[jira] [Assigned] (SPARK-31317) Add withField method to Column class
[ https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31317: --- Assignee: fqaiser94 > Add withField method to Column class > > > Key: SPARK-31317 > URL: https://issues.apache.org/jira/browse/SPARK-31317 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: fqaiser94 >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32219) Add SHOW CACHED TABLES Command
ulysses you created SPARK-32219: --- Summary: Add SHOW CACHED TABLES Command Key: SPARK-32219 URL: https://issues.apache.org/jira/browse/SPARK-32219 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32159: -- Fix Version/s: (was: 3.0.1) Target Version/s: 3.0.1 > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131 ] JinxinTang edited comment on SPARK-32205 at 7/8/20, 5:10 AM: - [~nileshr.patil] Seems an issue, we can insert timestamp to datetime column predefined in mysql. The table could not auto create by spark currently. was (Author: jinxintang): [~nileshr.patil] Seems a issue, we can insert timestamp to datetime column predefined in mysql. The table could not auto create by spark currently. > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32218) spark-ml must support one hot encoded output labels for classification
Raghuvarran V H created SPARK-32218: --- Summary: spark-ml must support one hot encoded output labels for classification Key: SPARK-32218 URL: https://issues.apache.org/jira/browse/SPARK-32218 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.0 Reporter: Raghuvarran V H In any classification algorithm, for target labels that have no ordinal relationship, it is advised to one hot encode the target labels. Refer here: [https://stackoverflow.com/questions/51384911/one-hot-encoding-of-output-labels/53291690#53291690] [https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/] spark-ml is not supporting the one hot encoded target labels. When I try, i get the below error: IllegalArgumentException: u'requirement failed: Column label_ohe must be of type numeric but was actually of type struct,values:array>.' So it will be nice if OHE is supported for target labels -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32213) saveAsTable deletes all files in path
[ https://issues.apache.org/jira/browse/SPARK-32213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153200#comment-17153200 ] angerszhu edited comment on SPARK-32213 at 7/8/20, 3:13 AM: https://issues.apache.org/jira/browse/SPARK-28551 seems you meet problem I mentioned in was (Author: angerszhuuu): https://issues.apache.org/jira/browse/SPARK- [25290|https://github.com/apache/spark/pull/25290/files] seems you meet problem I mentioned in > saveAsTable deletes all files in path > - > > Key: SPARK-32213 > URL: https://issues.apache.org/jira/browse/SPARK-32213 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Yuval Rochman >Priority: Major > > The problem is presented in the following link: > [https://stackoverflow.com/questions/62782637/saveastable-can-delete-all-my-files-in-desktop?noredirect=1#comment111026138_62782637] > Apparently, without no warning, all files is desktop where deleted after > writing a file. > There is no warning in Pyspark that the "Path" parameter makes that problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32213) saveAsTable deletes all files in path
[ https://issues.apache.org/jira/browse/SPARK-32213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153200#comment-17153200 ] angerszhu commented on SPARK-32213: --- https://issues.apache.org/jira/browse/SPARK- [25290|https://github.com/apache/spark/pull/25290/files] seems you meet problem I mentioned in > saveAsTable deletes all files in path > - > > Key: SPARK-32213 > URL: https://issues.apache.org/jira/browse/SPARK-32213 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Yuval Rochman >Priority: Major > > The problem is presented in the following link: > [https://stackoverflow.com/questions/62782637/saveastable-can-delete-all-my-files-in-desktop?noredirect=1#comment111026138_62782637] > Apparently, without no warning, all files is desktop where deleted after > writing a file. > There is no warning in Pyspark that the "Path" parameter makes that problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153180#comment-17153180 ] Dongjoon Hyun commented on SPARK-32163: --- This lands at branch-3.0 via https://github.com/apache/spark/pull/29027 . > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32163: -- Fix Version/s: 3.0.1 > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30703) Add a documentation page for ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153177#comment-17153177 ] Apache Spark commented on SPARK-30703: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/29033 > Add a documentation page for ANSI mode > -- > > Key: SPARK-30703 > URL: https://issues.apache.org/jira/browse/SPARK-30703 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > > ANSI mode is introduced in Spark 3.0. We need to clearly document the > behavior difference when spark.sql.ansi.enabled is on and off. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20680: Assignee: (was: Apache Spark) > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-20680. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28833 [https://github.com/apache/spark/pull/28833] > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.1.0 > > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-20680: - Assignee: Lantao Jin > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20680: Assignee: Apache Spark > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20680: -- Affects Version/s: 2.4.6 3.0.0 > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-20680: --- > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Lantao Jin >Priority: Major > Labels: bulk-closed > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20680: -- Labels: (was: bulk-closed) > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Lantao Jin >Priority: Major > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153170#comment-17153170 ] AidenZhang commented on SPARK-29038: Hi [~cltlfcjin],Thanks for you reply The situation is that Recently our company are about to implement materialized view in sparkSQL,we are going to optimize catalyst to support query rewrite,and replace table using materialized view if applicable,The corresponding data of materialized view is stored on HDFS, and the structure information of materialized view is stored in hive metastore,Our plan is to implement materialized view management of spark SQL based on hive.There are two people in our team now. could you please evaluate how long it will take to implement this function? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31760) Simplification Based on Containment
[ https://issues.apache.org/jira/browse/SPARK-31760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153161#comment-17153161 ] Nikita Glashenko commented on SPARK-31760: -- Hi, [~yumwang]. I'd like to work on this. > Simplification Based on Containment > --- > > Key: SPARK-31760 > URL: https://issues.apache.org/jira/browse/SPARK-31760 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Labels: starter > > https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor
[ https://issues.apache.org/jira/browse/SPARK-32217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32217: Assignee: Apache Spark > Track whether the worker is also being decommissioned along with an executor > > > Key: SPARK-32217 > URL: https://issues.apache.org/jira/browse/SPARK-32217 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Assignee: Apache Spark >Priority: Major > > When an executor is decommissioned, we would like to know if its shuffle data > is truly going to be lost. In the case of external shuffle service, this > means knowing that the worker (or the node that the executor is on) is also > going to be lost. > > ( I don't think we need to worry about disaggregated remote shuffle storage > at present since those are only used in a couple of web companies – but when > there is remote shuffle then yes the shuffle won't be lost with a > decommissioned executor ) > > We know for sure that a worker is being decommissioned when the Master is > asked to decommission a worker. In case of other schedulers: > * Yarn support for decommissioning isn't implemented yet. But the idea would > be for Yarn preeemption to not mark that the worker is being lost, but > machine level decommissioning (like for kernel upgrades) to do mark such. > * K8s isn't quite working with external shuffle service as yet, so when the > executor is lost, the worker isn't quite lost with it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor
[ https://issues.apache.org/jira/browse/SPARK-32217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32217: Assignee: (was: Apache Spark) > Track whether the worker is also being decommissioned along with an executor > > > Key: SPARK-32217 > URL: https://issues.apache.org/jira/browse/SPARK-32217 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Priority: Major > > When an executor is decommissioned, we would like to know if its shuffle data > is truly going to be lost. In the case of external shuffle service, this > means knowing that the worker (or the node that the executor is on) is also > going to be lost. > > ( I don't think we need to worry about disaggregated remote shuffle storage > at present since those are only used in a couple of web companies – but when > there is remote shuffle then yes the shuffle won't be lost with a > decommissioned executor ) > > We know for sure that a worker is being decommissioned when the Master is > asked to decommission a worker. In case of other schedulers: > * Yarn support for decommissioning isn't implemented yet. But the idea would > be for Yarn preeemption to not mark that the worker is being lost, but > machine level decommissioning (like for kernel upgrades) to do mark such. > * K8s isn't quite working with external shuffle service as yet, so when the > executor is lost, the worker isn't quite lost with it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor
[ https://issues.apache.org/jira/browse/SPARK-32217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153157#comment-17153157 ] Apache Spark commented on SPARK-32217: -- User 'agrawaldevesh' has created a pull request for this issue: https://github.com/apache/spark/pull/29032 > Track whether the worker is also being decommissioned along with an executor > > > Key: SPARK-32217 > URL: https://issues.apache.org/jira/browse/SPARK-32217 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Priority: Major > > When an executor is decommissioned, we would like to know if its shuffle data > is truly going to be lost. In the case of external shuffle service, this > means knowing that the worker (or the node that the executor is on) is also > going to be lost. > > ( I don't think we need to worry about disaggregated remote shuffle storage > at present since those are only used in a couple of web companies – but when > there is remote shuffle then yes the shuffle won't be lost with a > decommissioned executor ) > > We know for sure that a worker is being decommissioned when the Master is > asked to decommission a worker. In case of other schedulers: > * Yarn support for decommissioning isn't implemented yet. But the idea would > be for Yarn preeemption to not mark that the worker is being lost, but > machine level decommissioning (like for kernel upgrades) to do mark such. > * K8s isn't quite working with external shuffle service as yet, so when the > executor is lost, the worker isn't quite lost with it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor
Devesh Agrawal created SPARK-32217: -- Summary: Track whether the worker is also being decommissioned along with an executor Key: SPARK-32217 URL: https://issues.apache.org/jira/browse/SPARK-32217 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: Devesh Agrawal When an executor is decommissioned, we would like to know if its shuffle data is truly going to be lost. In the case of external shuffle service, this means knowing that the worker (or the node that the executor is on) is also going to be lost. ( I don't think we need to worry about disaggregated remote shuffle storage at present since those are only used in a couple of web companies – but when there is remote shuffle then yes the shuffle won't be lost with a decommissioned executor ) We know for sure that a worker is being decommissioned when the Master is asked to decommission a worker. In case of other schedulers: * Yarn support for decommissioning isn't implemented yet. But the idea would be for Yarn preeemption to not mark that the worker is being lost, but machine level decommissioning (like for kernel upgrades) to do mark such. * K8s isn't quite working with external shuffle service as yet, so when the executor is lost, the worker isn't quite lost with it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131 ] JinxinTang edited comment on SPARK-32205 at 7/8/20, 12:41 AM: -- [~nileshr.patil] Seems a issue, we can insert timestamp to datetime column predefined in mysql. The table could not auto create by spark currently. was (Author: jinxintang): [~nileshr.patil] Seems a issue, we can insert timestamp to datetime column predefined in mysql. The table should not auto create by spark currently. > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32057. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28912 [https://github.com/apache/spark/pull/28912] > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Assignee: Ali Smesseim >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32057: Assignee: Ali Smesseim > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Assignee: Ali Smesseim >Priority: Major > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32216) Remove redundant ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32216: Assignee: (was: Apache Spark) > Remove redundant ProjectExec > > > Key: SPARK-32216 > URL: https://issues.apache.org/jira/browse/SPARK-32216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Allison Wang >Priority: Major > > Currently Spark executed plan can have redundant `ProjectExec` node. For > example: > After Filter: > {code:java} > == Physical Plan == > *(1) Project [a#14L, b#15L, c#16, key#17] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) > +- *(1) ColumnarToRow > +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} > The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is > exactly the same as filter's output. > Before Aggregate: > {code:java} > == Physical Plan == > *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], > output=[sum_a#39L, key#17, last_b#41L]) > +- Exchange hashpartitioning(key#17, 5), true, [id=#77] >+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), > partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) > +- *(1) Project [key#17, a#14L, b#15L] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) > +- *(1) ColumnarToRow >+- FileScan parquet [a#14L,b#15L,key#17] {code} > The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate > doesn't require child plan's output to be in a specific order. > > In general, a project is redundant when > # It has the same output attributes and order as its child's output when > ordering of these attributes is required. > # It has the same output attributes as its child's output when attribute > output ordering is not required. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32216) Remove redundant ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153136#comment-17153136 ] Apache Spark commented on SPARK-32216: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/29031 > Remove redundant ProjectExec > > > Key: SPARK-32216 > URL: https://issues.apache.org/jira/browse/SPARK-32216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Allison Wang >Priority: Major > > Currently Spark executed plan can have redundant `ProjectExec` node. For > example: > After Filter: > {code:java} > == Physical Plan == > *(1) Project [a#14L, b#15L, c#16, key#17] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) > +- *(1) ColumnarToRow > +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} > The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is > exactly the same as filter's output. > Before Aggregate: > {code:java} > == Physical Plan == > *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], > output=[sum_a#39L, key#17, last_b#41L]) > +- Exchange hashpartitioning(key#17, 5), true, [id=#77] >+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), > partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) > +- *(1) Project [key#17, a#14L, b#15L] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) > +- *(1) ColumnarToRow >+- FileScan parquet [a#14L,b#15L,key#17] {code} > The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate > doesn't require child plan's output to be in a specific order. > > In general, a project is redundant when > # It has the same output attributes and order as its child's output when > ordering of these attributes is required. > # It has the same output attributes as its child's output when attribute > output ordering is not required. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32216) Remove redundant ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32216: Assignee: Apache Spark > Remove redundant ProjectExec > > > Key: SPARK-32216 > URL: https://issues.apache.org/jira/browse/SPARK-32216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > Currently Spark executed plan can have redundant `ProjectExec` node. For > example: > After Filter: > {code:java} > == Physical Plan == > *(1) Project [a#14L, b#15L, c#16, key#17] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) > +- *(1) ColumnarToRow > +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} > The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is > exactly the same as filter's output. > Before Aggregate: > {code:java} > == Physical Plan == > *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], > output=[sum_a#39L, key#17, last_b#41L]) > +- Exchange hashpartitioning(key#17, 5), true, [id=#77] >+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), > partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) > +- *(1) Project [key#17, a#14L, b#15L] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) > +- *(1) ColumnarToRow >+- FileScan parquet [a#14L,b#15L,key#17] {code} > The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate > doesn't require child plan's output to be in a specific order. > > In general, a project is redundant when > # It has the same output attributes and order as its child's output when > ordering of these attributes is required. > # It has the same output attributes as its child's output when attribute > output ordering is not required. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32216) Remove redundant ProjectExec
Allison Wang created SPARK-32216: Summary: Remove redundant ProjectExec Key: SPARK-32216 URL: https://issues.apache.org/jira/browse/SPARK-32216 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Allison Wang Currently Spark executed plan can have redundant `ProjectExec` node. For example: After Filter: {code:java} == Physical Plan == *(1) Project [a#14L, b#15L, c#16, key#17] +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) +- *(1) ColumnarToRow +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is exactly the same as filter's output. Before Aggregate: {code:java} == Physical Plan == *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L]) +- Exchange hashpartitioning(key#17, 5), true, [id=#77] +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) +- *(1) Project [key#17, a#14L, b#15L] +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) +- *(1) ColumnarToRow +- FileScan parquet [a#14L,b#15L,key#17] {code} The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate doesn't require child plan's output to be in a specific order. In general, a project is redundant when # It has the same output attributes and order as its child's output when ordering of these attributes is required. # It has the same output attributes as its child's output when attribute output ordering is not required. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131 ] JinxinTang edited comment on SPARK-32205 at 7/7/20, 11:31 PM: -- [~nileshr.patil] Seems a issue, we can insert timestamp to datetime column predefined in mysql. The table should not auto create by spark currently. was (Author: jinxintang): [~nileshr.patil] Seems a issue, we can insert timestamp to datetime column predefined in mysql, but spark and mysql have different timestamp range. The table should not auto create by spark currently. > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131 ] JinxinTang commented on SPARK-32205: [~nileshr.patil] Seems a issue, we can insert timestamp to datetime column predefined in mysql, but spark and mysql have different timestamp range. The table should not auto create by spark currently. > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
[ https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153127#comment-17153127 ] Apache Spark commented on SPARK-32215: -- User 'agrawaldevesh' has created a pull request for this issue: https://github.com/apache/spark/pull/29015 > Expose end point on Master so that it can be informed about decommissioned > workers out of band > -- > > Key: SPARK-32215 > URL: https://issues.apache.org/jira/browse/SPARK-32215 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: Standalone Scheduler >Reporter: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > The use case here is to allow some external entity that has made a > decommissioning decision to inform the Master (in case of Standalone > scheduling mode) > The current decommissioning is triggered by the Worker getting getting a > SIGPWR > (out of band possibly by some cleanup hook), which then informs the Master > about it. This approach may not be feasible in some environments that cannot > trigger a clean up hook on the Worker. > Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an > external agent to inform the master about all the nodes being decommissioned > in > bulk. The workers are identified by either their {{host:port}} or just the > host > – in which case all workers on the host would be decommissioned. > This API is merely a new entry point into the existing decommissioning > logic. It does not change how the decommissioning request is handled in > its core. > The path /workers/kill is so chosen to be consistent with the other endpoint > names on the MasterWebUI. > Since this is a sensitive operation, this API will be disabled by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
[ https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153126#comment-17153126 ] Apache Spark commented on SPARK-32215: -- User 'agrawaldevesh' has created a pull request for this issue: https://github.com/apache/spark/pull/29015 > Expose end point on Master so that it can be informed about decommissioned > workers out of band > -- > > Key: SPARK-32215 > URL: https://issues.apache.org/jira/browse/SPARK-32215 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: Standalone Scheduler >Reporter: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > The use case here is to allow some external entity that has made a > decommissioning decision to inform the Master (in case of Standalone > scheduling mode) > The current decommissioning is triggered by the Worker getting getting a > SIGPWR > (out of band possibly by some cleanup hook), which then informs the Master > about it. This approach may not be feasible in some environments that cannot > trigger a clean up hook on the Worker. > Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an > external agent to inform the master about all the nodes being decommissioned > in > bulk. The workers are identified by either their {{host:port}} or just the > host > – in which case all workers on the host would be decommissioned. > This API is merely a new entry point into the existing decommissioning > logic. It does not change how the decommissioning request is handled in > its core. > The path /workers/kill is so chosen to be consistent with the other endpoint > names on the MasterWebUI. > Since this is a sensitive operation, this API will be disabled by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
[ https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32215: Assignee: Apache Spark > Expose end point on Master so that it can be informed about decommissioned > workers out of band > -- > > Key: SPARK-32215 > URL: https://issues.apache.org/jira/browse/SPARK-32215 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: Standalone Scheduler >Reporter: Devesh Agrawal >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > The use case here is to allow some external entity that has made a > decommissioning decision to inform the Master (in case of Standalone > scheduling mode) > The current decommissioning is triggered by the Worker getting getting a > SIGPWR > (out of band possibly by some cleanup hook), which then informs the Master > about it. This approach may not be feasible in some environments that cannot > trigger a clean up hook on the Worker. > Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an > external agent to inform the master about all the nodes being decommissioned > in > bulk. The workers are identified by either their {{host:port}} or just the > host > – in which case all workers on the host would be decommissioned. > This API is merely a new entry point into the existing decommissioning > logic. It does not change how the decommissioning request is handled in > its core. > The path /workers/kill is so chosen to be consistent with the other endpoint > names on the MasterWebUI. > Since this is a sensitive operation, this API will be disabled by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
[ https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32215: Assignee: (was: Apache Spark) > Expose end point on Master so that it can be informed about decommissioned > workers out of band > -- > > Key: SPARK-32215 > URL: https://issues.apache.org/jira/browse/SPARK-32215 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: Standalone Scheduler >Reporter: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > The use case here is to allow some external entity that has made a > decommissioning decision to inform the Master (in case of Standalone > scheduling mode) > The current decommissioning is triggered by the Worker getting getting a > SIGPWR > (out of band possibly by some cleanup hook), which then informs the Master > about it. This approach may not be feasible in some environments that cannot > trigger a clean up hook on the Worker. > Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an > external agent to inform the master about all the nodes being decommissioned > in > bulk. The workers are identified by either their {{host:port}} or just the > host > – in which case all workers on the host would be decommissioned. > This API is merely a new entry point into the existing decommissioning > logic. It does not change how the decommissioning request is handled in > its core. > The path /workers/kill is so chosen to be consistent with the other endpoint > names on the MasterWebUI. > Since this is a sensitive operation, this API will be disabled by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
[ https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devesh Agrawal updated SPARK-32215: --- Description: The use case here is to allow some external entity that has made a decommissioning decision to inform the Master (in case of Standalone scheduling mode) The current decommissioning is triggered by the Worker getting getting a SIGPWR (out of band possibly by some cleanup hook), which then informs the Master about it. This approach may not be feasible in some environments that cannot trigger a clean up hook on the Worker. Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an external agent to inform the master about all the nodes being decommissioned in bulk. The workers are identified by either their {{host:port}} or just the host – in which case all workers on the host would be decommissioned. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. The path /workers/kill is so chosen to be consistent with the other endpoint names on the MasterWebUI. Since this is a sensitive operation, this API will be disabled by default. was: The use case here is to allow some external entity that has made a decommissioning decision to inform the Master (in case of Standalone scheduling mode) The current decommissioning is triggered by the Worker getting getting a SIGPWR (out of band possibly by some cleanup hook), which then informs the Master about it. This approach may not be feasible in some environments that cannot trigger a clean up hook on the Worker. Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an external agent to inform the master about all the nodes being decommissioned in bulk. The workers are identified by either their {{host:port}} or just the host -- in which case all workers on the host would be decommissioned. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. The path /workers/kill is so chosen to be consistent with the other endpoint names on the MasterWebUI. > Expose end point on Master so that it can be informed about decommissioned > workers out of band > -- > > Key: SPARK-32215 > URL: https://issues.apache.org/jira/browse/SPARK-32215 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: Standalone Scheduler >Reporter: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > The use case here is to allow some external entity that has made a > decommissioning decision to inform the Master (in case of Standalone > scheduling mode) > The current decommissioning is triggered by the Worker getting getting a > SIGPWR > (out of band possibly by some cleanup hook), which then informs the Master > about it. This approach may not be feasible in some environments that cannot > trigger a clean up hook on the Worker. > Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an > external agent to inform the master about all the nodes being decommissioned > in > bulk. The workers are identified by either their {{host:port}} or just the > host > – in which case all workers on the host would be decommissioned. > This API is merely a new entry point into the existing decommissioning > logic. It does not change how the decommissioning request is handled in > its core. > The path /workers/kill is so chosen to be consistent with the other endpoint > names on the MasterWebUI. > Since this is a sensitive operation, this API will be disabled by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
Devesh Agrawal created SPARK-32215: -- Summary: Expose end point on Master so that it can be informed about decommissioned workers out of band Key: SPARK-32215 URL: https://issues.apache.org/jira/browse/SPARK-32215 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Environment: Standalone Scheduler Reporter: Devesh Agrawal Fix For: 3.1.0 The use case here is to allow some external entity that has made a decommissioning decision to inform the Master (in case of Standalone scheduling mode) The current decommissioning is triggered by the Worker getting getting a SIGPWR (out of band possibly by some cleanup hook), which then informs the Master about it. This approach may not be feasible in some environments that cannot trigger a clean up hook on the Worker. Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an external agent to inform the master about all the nodes being decommissioned in bulk. The workers are identified by either their {{host:port}} or just the host -- in which case all workers on the host would be decommissioned. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. The path /workers/kill is so chosen to be consistent with the other endpoint names on the MasterWebUI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32093) Add hadoop-ozone-filesystem jar to ozone profile
[ https://issues.apache.org/jira/browse/SPARK-32093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153123#comment-17153123 ] Apache Spark commented on SPARK-32093: -- User 'bharatviswa504' has created a pull request for this issue: https://github.com/apache/spark/pull/29030 > Add hadoop-ozone-filesystem jar to ozone profile > > > Key: SPARK-32093 > URL: https://issues.apache.org/jira/browse/SPARK-32093 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Bharat Viswanadham >Priority: Major > > This Jira is to include Ozone filesystem jar with a new profile "ozone" in > mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32093) Add hadoop-ozone-filesystem jar to ozone profile
[ https://issues.apache.org/jira/browse/SPARK-32093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32093: Assignee: Apache Spark > Add hadoop-ozone-filesystem jar to ozone profile > > > Key: SPARK-32093 > URL: https://issues.apache.org/jira/browse/SPARK-32093 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Bharat Viswanadham >Assignee: Apache Spark >Priority: Major > > This Jira is to include Ozone filesystem jar with a new profile "ozone" in > mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32093) Add hadoop-ozone-filesystem jar to ozone profile
[ https://issues.apache.org/jira/browse/SPARK-32093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32093: Assignee: (was: Apache Spark) > Add hadoop-ozone-filesystem jar to ozone profile > > > Key: SPARK-32093 > URL: https://issues.apache.org/jira/browse/SPARK-32093 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Bharat Viswanadham >Priority: Major > > This Jira is to include Ozone filesystem jar with a new profile "ozone" in > mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson updated SPARK-32159: --- Fix Version/s: 3.0.1 > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > Fix For: 3.0.1 > > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
[ https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32214: Assignee: Kousuke Saruta (was: Apache Spark) > The type conversion function generated in makeFromJava for "other" type uses > a wrong variable. > --- > > Key: SPARK-32214 > URL: https://issues.apache.org/jira/browse/SPARK-32214 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > `makeFromJava` in `EvaluatePython` create a type conversion function for some > Java/Scala types. > For `other` type, the parameter of the type conversion function is named > `obj` but `other` is mistakenly used rather than `obj` in the function body. > {code:java} > case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
[ https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153118#comment-17153118 ] Apache Spark commented on SPARK-32214: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29029 > The type conversion function generated in makeFromJava for "other" type uses > a wrong variable. > --- > > Key: SPARK-32214 > URL: https://issues.apache.org/jira/browse/SPARK-32214 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > `makeFromJava` in `EvaluatePython` create a type conversion function for some > Java/Scala types. > For `other` type, the parameter of the type conversion function is named > `obj` but `other` is mistakenly used rather than `obj` in the function body. > {code:java} > case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
[ https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153117#comment-17153117 ] Apache Spark commented on SPARK-32214: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29029 > The type conversion function generated in makeFromJava for "other" type uses > a wrong variable. > --- > > Key: SPARK-32214 > URL: https://issues.apache.org/jira/browse/SPARK-32214 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > `makeFromJava` in `EvaluatePython` create a type conversion function for some > Java/Scala types. > For `other` type, the parameter of the type conversion function is named > `obj` but `other` is mistakenly used rather than `obj` in the function body. > {code:java} > case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
[ https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32214: Assignee: Apache Spark (was: Kousuke Saruta) > The type conversion function generated in makeFromJava for "other" type uses > a wrong variable. > --- > > Key: SPARK-32214 > URL: https://issues.apache.org/jira/browse/SPARK-32214 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > `makeFromJava` in `EvaluatePython` create a type conversion function for some > Java/Scala types. > For `other` type, the parameter of the type conversion function is named > `obj` but `other` is mistakenly used rather than `obj` in the function body. > {code:java} > case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
Kousuke Saruta created SPARK-32214: -- Summary: The type conversion function generated in makeFromJava for "other" type uses a wrong variable. Key: SPARK-32214 URL: https://issues.apache.org/jira/browse/SPARK-32214 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 2.4.6, 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta `makeFromJava` in `EvaluatePython` create a type conversion function for some Java/Scala types. For `other` type, the parameter of the type conversion function is named `obj` but `other` is mistakenly used rather than `obj` in the function body. {code:java} case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32213) saveAsTable deletes all files in path
Yuval Rochman created SPARK-32213: - Summary: saveAsTable deletes all files in path Key: SPARK-32213 URL: https://issues.apache.org/jira/browse/SPARK-32213 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Yuval Rochman The problem is presented in the following link: [https://stackoverflow.com/questions/62782637/saveastable-can-delete-all-my-files-in-desktop?noredirect=1#comment111026138_62782637] Apparently, without no warning, all files is desktop where deleted after writing a file. There is no warning in Pyspark that the "Path" parameter makes that problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook
[ https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31625. -- Resolution: Not A Problem > Unregister application from YARN resource manager outside the shutdown hook > --- > > Key: SPARK-31625 > URL: https://issues.apache.org/jira/browse/SPARK-31625 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Major > > Currently, an application is unregistered from YARN resource manager as a > shutdown hook. In the scenario where the shutdown hook does not run (e.g., > timeouts, etc.), the application is not unregistered, resulting in YARN > resubmitting the application even if it succeeded. > For example, you could see the following on the driver log: > {code:java} > 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext > 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0 > 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) > at > org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) > {code} > On the YARN RM side: > {code:java} > 2020-04-30 06:21:25,083 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1588227360159_0001_01_01 Container Transitioned from RUNNING to > COMPLETED > 2020-04-30 06:21:25,085 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Updating application attempt appattempt_1588227360159_0001_01 with final > state: FAILED, and exit status: 0 > 2020-04-30 06:21:25,085 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1588227360159_0001_01 State change from RUNNING to > FINAL_SAVING on event = CONTAINER_FINISHED > {code} > You see that the final state of the application becomes FAILED since the > container is finished before the application is unregistered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver
[ https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153060#comment-17153060 ] Apache Spark commented on SPARK-32212: -- User 'izchen' has created a pull request for this issue: https://github.com/apache/spark/pull/29028 > RDD.takeOrdered can choose to merge intermediate results in executor or driver > -- > > Key: SPARK-32212 > URL: https://issues.apache.org/jira/browse/SPARK-32212 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Chen Zhang >Priority: Major > > In the list of issues, I saw some discussions about exceeding the memory > limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit > xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved. > In the original code implementation of _RDD.takeOrdered_, the QuickSelect > algorithm in guava is used in the executor process to calculate the local > TopK results of each RDD partition. These intermediate results are packaged > into java.util.PriorityQueue and returned to the driver process. In the > driver process, these intermediate results are merged to get the global TopK > results. > The problem with this implementation is that if the intermediate results are > too large and too many partitions, the intermediate results may accumulate in > the memory of the driver process, causing excessive memory pressure. > We can use an optional config to determine whether the intermediate > results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in > driver process or executor process. If set to true, merge in driver > process(by util.PriorityQueue), which will get shorter waiting time for > return. But if the intermediate results are too large and too many > partitions, the intermediate results may accumulate in the memory of the > driver process, causing excessive memory pressure. If set to false, merge in > executor process(by guava.QuickSelect), intermediate results will not > accumulate in memory, but will cause longer runtimes. > something like: > _(org.apache.spark.rdd.RDD)_ > {code:scala} > def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { > if (num == 0 || partitions.length == 0) { > Array.empty > } else { > if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { > val mapRDDs = mapPartitions { items => > // Priority keeps the largest elements, so let's reverse the > ordering. > val queue = new BoundedPriorityQueue[T](num)(ord.reverse) > queue ++= collectionUtils.takeOrdered(items, num)(ord) > Iterator.single(queue) > } > mapRDDs.reduce { (queue1, queue2) => > queue1 ++= queue2 > queue1 > }.toArray.sorted(ord) > } else { > mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.repartition(1).mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.collect() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver
[ https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32212: Assignee: (was: Apache Spark) > RDD.takeOrdered can choose to merge intermediate results in executor or driver > -- > > Key: SPARK-32212 > URL: https://issues.apache.org/jira/browse/SPARK-32212 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Chen Zhang >Priority: Major > > In the list of issues, I saw some discussions about exceeding the memory > limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit > xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved. > In the original code implementation of _RDD.takeOrdered_, the QuickSelect > algorithm in guava is used in the executor process to calculate the local > TopK results of each RDD partition. These intermediate results are packaged > into java.util.PriorityQueue and returned to the driver process. In the > driver process, these intermediate results are merged to get the global TopK > results. > The problem with this implementation is that if the intermediate results are > too large and too many partitions, the intermediate results may accumulate in > the memory of the driver process, causing excessive memory pressure. > We can use an optional config to determine whether the intermediate > results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in > driver process or executor process. If set to true, merge in driver > process(by util.PriorityQueue), which will get shorter waiting time for > return. But if the intermediate results are too large and too many > partitions, the intermediate results may accumulate in the memory of the > driver process, causing excessive memory pressure. If set to false, merge in > executor process(by guava.QuickSelect), intermediate results will not > accumulate in memory, but will cause longer runtimes. > something like: > _(org.apache.spark.rdd.RDD)_ > {code:scala} > def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { > if (num == 0 || partitions.length == 0) { > Array.empty > } else { > if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { > val mapRDDs = mapPartitions { items => > // Priority keeps the largest elements, so let's reverse the > ordering. > val queue = new BoundedPriorityQueue[T](num)(ord.reverse) > queue ++= collectionUtils.takeOrdered(items, num)(ord) > Iterator.single(queue) > } > mapRDDs.reduce { (queue1, queue2) => > queue1 ++= queue2 > queue1 > }.toArray.sorted(ord) > } else { > mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.repartition(1).mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.collect() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver
[ https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153059#comment-17153059 ] Apache Spark commented on SPARK-32212: -- User 'izchen' has created a pull request for this issue: https://github.com/apache/spark/pull/29028 > RDD.takeOrdered can choose to merge intermediate results in executor or driver > -- > > Key: SPARK-32212 > URL: https://issues.apache.org/jira/browse/SPARK-32212 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Chen Zhang >Priority: Major > > In the list of issues, I saw some discussions about exceeding the memory > limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit > xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved. > In the original code implementation of _RDD.takeOrdered_, the QuickSelect > algorithm in guava is used in the executor process to calculate the local > TopK results of each RDD partition. These intermediate results are packaged > into java.util.PriorityQueue and returned to the driver process. In the > driver process, these intermediate results are merged to get the global TopK > results. > The problem with this implementation is that if the intermediate results are > too large and too many partitions, the intermediate results may accumulate in > the memory of the driver process, causing excessive memory pressure. > We can use an optional config to determine whether the intermediate > results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in > driver process or executor process. If set to true, merge in driver > process(by util.PriorityQueue), which will get shorter waiting time for > return. But if the intermediate results are too large and too many > partitions, the intermediate results may accumulate in the memory of the > driver process, causing excessive memory pressure. If set to false, merge in > executor process(by guava.QuickSelect), intermediate results will not > accumulate in memory, but will cause longer runtimes. > something like: > _(org.apache.spark.rdd.RDD)_ > {code:scala} > def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { > if (num == 0 || partitions.length == 0) { > Array.empty > } else { > if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { > val mapRDDs = mapPartitions { items => > // Priority keeps the largest elements, so let's reverse the > ordering. > val queue = new BoundedPriorityQueue[T](num)(ord.reverse) > queue ++= collectionUtils.takeOrdered(items, num)(ord) > Iterator.single(queue) > } > mapRDDs.reduce { (queue1, queue2) => > queue1 ++= queue2 > queue1 > }.toArray.sorted(ord) > } else { > mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.repartition(1).mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.collect() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver
[ https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32212: Assignee: Apache Spark > RDD.takeOrdered can choose to merge intermediate results in executor or driver > -- > > Key: SPARK-32212 > URL: https://issues.apache.org/jira/browse/SPARK-32212 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Chen Zhang >Assignee: Apache Spark >Priority: Major > > In the list of issues, I saw some discussions about exceeding the memory > limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit > xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved. > In the original code implementation of _RDD.takeOrdered_, the QuickSelect > algorithm in guava is used in the executor process to calculate the local > TopK results of each RDD partition. These intermediate results are packaged > into java.util.PriorityQueue and returned to the driver process. In the > driver process, these intermediate results are merged to get the global TopK > results. > The problem with this implementation is that if the intermediate results are > too large and too many partitions, the intermediate results may accumulate in > the memory of the driver process, causing excessive memory pressure. > We can use an optional config to determine whether the intermediate > results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in > driver process or executor process. If set to true, merge in driver > process(by util.PriorityQueue), which will get shorter waiting time for > return. But if the intermediate results are too large and too many > partitions, the intermediate results may accumulate in the memory of the > driver process, causing excessive memory pressure. If set to false, merge in > executor process(by guava.QuickSelect), intermediate results will not > accumulate in memory, but will cause longer runtimes. > something like: > _(org.apache.spark.rdd.RDD)_ > {code:scala} > def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { > if (num == 0 || partitions.length == 0) { > Array.empty > } else { > if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { > val mapRDDs = mapPartitions { items => > // Priority keeps the largest elements, so let's reverse the > ordering. > val queue = new BoundedPriorityQueue[T](num)(ord.reverse) > queue ++= collectionUtils.takeOrdered(items, num)(ord) > Iterator.single(queue) > } > mapRDDs.reduce { (queue1, queue2) => > queue1 ++= queue2 > queue1 > }.toArray.sorted(ord) > } else { > mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.repartition(1).mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.collect() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver
[ https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated SPARK-32212: --- Summary: RDD.takeOrdered can choose to merge intermediate results in executor or driver (was: RDD.takeOrdered merge intermediate results can be configured in driver or executor) > RDD.takeOrdered can choose to merge intermediate results in executor or driver > -- > > Key: SPARK-32212 > URL: https://issues.apache.org/jira/browse/SPARK-32212 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Chen Zhang >Priority: Major > > In the list of issues, I saw some discussions about exceeding the memory > limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit > xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved. > In the original code implementation of _RDD.takeOrdered_, the QuickSelect > algorithm in guava is used in the executor process to calculate the local > TopK results of each RDD partition. These intermediate results are packaged > into java.util.PriorityQueue and returned to the driver process. In the > driver process, these intermediate results are merged to get the global TopK > results. > The problem with this implementation is that if the intermediate results are > too large and too many partitions, the intermediate results may accumulate in > the memory of the driver process, causing excessive memory pressure. > We can use an optional config to determine whether the intermediate > results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in > driver process or executor process. If set to true, merge in driver > process(by util.PriorityQueue), which will get shorter waiting time for > return. But if the intermediate results are too large and too many > partitions, the intermediate results may accumulate in the memory of the > driver process, causing excessive memory pressure. If set to false, merge in > executor process(by guava.QuickSelect), intermediate results will not > accumulate in memory, but will cause longer runtimes. > something like: > _(org.apache.spark.rdd.RDD)_ > {code:scala} > def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { > if (num == 0 || partitions.length == 0) { > Array.empty > } else { > if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { > val mapRDDs = mapPartitions { items => > // Priority keeps the largest elements, so let's reverse the > ordering. > val queue = new BoundedPriorityQueue[T](num)(ord.reverse) > queue ++= collectionUtils.takeOrdered(items, num)(ord) > Iterator.single(queue) > } > mapRDDs.reduce { (queue1, queue2) => > queue1 ++= queue2 > queue1 > }.toArray.sorted(ord) > } else { > mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.repartition(1).mapPartitions { items => > collectionUtils.takeOrdered(items, num)(ord) > }.collect() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152977#comment-17152977 ] Apache Spark commented on SPARK-32163: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29027 > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32163. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28988 [https://github.com/apache/spark/pull/28988] > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32212) RDD.takeOrdered merge intermediate results can be configured in driver or executor
Chen Zhang created SPARK-32212: -- Summary: RDD.takeOrdered merge intermediate results can be configured in driver or executor Key: SPARK-32212 URL: https://issues.apache.org/jira/browse/SPARK-32212 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Chen Zhang In the list of issues, I saw some discussions about exceeding the memory limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved. In the original code implementation of _RDD.takeOrdered_, the QuickSelect algorithm in guava is used in the executor process to calculate the local TopK results of each RDD partition. These intermediate results are packaged into java.util.PriorityQueue and returned to the driver process. In the driver process, these intermediate results are merged to get the global TopK results. The problem with this implementation is that if the intermediate results are too large and too many partitions, the intermediate results may accumulate in the memory of the driver process, causing excessive memory pressure. We can use an optional config to determine whether the intermediate results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in driver process or executor process. If set to true, merge in driver process(by util.PriorityQueue), which will get shorter waiting time for return. But if the intermediate results are too large and too many partitions, the intermediate results may accumulate in the memory of the driver process, causing excessive memory pressure. If set to false, merge in executor process(by guava.QuickSelect), intermediate results will not accumulate in memory, but will cause longer runtimes. something like: _(org.apache.spark.rdd.RDD)_ {code:scala} def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0 || partitions.length == 0) { Array.empty } else { if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= collectionUtils.takeOrdered(items, num)(ord) Iterator.single(queue) } mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } else { mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.repartition(1).mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.collect() } } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32174) toPandas attempted Arrow optimization but has reached an error and can not continue
[ https://issues.apache.org/jira/browse/SPARK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152958#comment-17152958 ] Bryan Cutler commented on SPARK-32174: -- >From the stacktrace, it looks like you are using JDK9 or above, which Arrow >(really netty) needs the JVM system property >\{{io.netty.tryReflectionSetAccessible}} set to true, see >https://issues.apache.org/jira/browse/SPARK-29923 , also in the release notes. >Could you confirm if this solves your issue? > toPandas attempted Arrow optimization but has reached an error and can not > continue > --- > > Key: SPARK-32174 > URL: https://issues.apache.org/jira/browse/SPARK-32174 > Project: Spark > Issue Type: Bug > Components: Optimizer, PySpark >Affects Versions: 3.0.0 > Environment: Spark 3.0.0, running in *stand-alone* mode >Reporter: Ramin Hazegh >Priority: Major > > h4. Converting a dataframe to Panda data frame using toPandas() fails. > > *Spark 3.0.0 Running in stand-alone mode* using docker containers based on > jupyter docker stack here: > [https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile] > > $ conda list | grep arrow > *arrow-cpp 0.17.1* py38h1234567_5_cpu conda-forge > *pyarrow 0.17.1* py38h1234567_5_cpu conda-forge > $ conda list | grep pandas > *pandas 1.0.5* py38hcb8c335_0 conda-forge > > *To reproduce:* > {code:java} > import numpy as np > import pandas as pd > from pyspark.sql import SparkSession > spark = SparkSession.builder.master("spark://10.0.1.40:7077") \ > .config("spark.sql.execution.arrow.enabled", "true") \ > .appName('test_arrow') \ > .getOrCreate() > > # Generate a pandas DataFrame > pdf = pd.DataFrame(np.random.rand(100, 3)) > # Create a Spark DataFrame from a pandas DataFrame using Arrow > df = spark.createDataFrame(pdf) > # Convert the Spark DataFrame back to a pandas DataFrame using Arrow > result_pdf = df.select("*").toPandas() > {code} > > == > /usr/local/spark/python/pyspark/sql/pandas/conversion.py:134: UserWarning: > toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > An error occurred while calling o55.getResult. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:88) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:84) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.base/java.lang.Thread.run(Thread.java:834) > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 14 in stage 0.0 failed 4 times, most recent failure: Lost task > 14.3 in stage 0.0 (TID 31, 10.0.1.43, executor 0): > java.lang.UnsupportedOperationException: sun.misc.Unsafe or > java.nio.DirectByteBuffer.(long, int) not available > at > io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:490) > at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) > at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) > at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) > at org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:81) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:696) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeRecordBatch(MessageSerializer.java:344) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$.loadBatch(ArrowConverters.scala:189) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:165) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.(ArrowConverters.scala:144) >
[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32163: -- Description: If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. was: Note that this is just an optimization issue and not a regression. The newly introduced optimizer doesn't optimize this corner case. If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31317) Add withField method to Column class
[ https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152894#comment-17152894 ] fqaiser94 commented on SPARK-31317: --- Done. > Add withField method to Column class > > > Key: SPARK-31317 > URL: https://issues.apache.org/jira/browse/SPARK-31317 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32211) Pin mariadb-plugin-gssapi-server version to fix MariaDBKrbIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32211: -- Summary: Pin mariadb-plugin-gssapi-server version to fix MariaDBKrbIntegrationSuite (was: MariaDBKrbIntegrationSuite fails because of unwanted server upgrade) > Pin mariadb-plugin-gssapi-server version to fix MariaDBKrbIntegrationSuite > -- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered unwanted database upgrade > inside the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32211: - Assignee: Gabor Somogyi > MariaDBKrbIntegrationSuite fails because of unwanted server upgrade > --- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered unwanted database upgrade > inside the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32211. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29025 [https://github.com/apache/spark/pull/29025] > MariaDBKrbIntegrationSuite fails because of unwanted server upgrade > --- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered unwanted database upgrade > inside the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31317) Add withField method to Column class
[ https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31317. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27066 [https://github.com/apache/spark/pull/27066] > Add withField method to Column class > > > Key: SPARK-31317 > URL: https://issues.apache.org/jira/browse/SPARK-31317 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152871#comment-17152871 ] Apache Spark commented on SPARK-32018: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29026 > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32018: Assignee: (was: Apache Spark) > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32018: Assignee: Apache Spark > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152869#comment-17152869 ] Apache Spark commented on SPARK-32018: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29026 > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152867#comment-17152867 ] Apache Spark commented on SPARK-28067: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29026 > Incorrect results in decimal aggregation with whole-stage code gen enabled > -- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Mark Sirek >Assignee: Sunitha Kambhampati >Priority: Critical > Labels: correctness > Fix For: 3.1.0 > > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --- > sum(decNum) > --- > 4000.00 > --- > > {code} > > The result should be 104000.. > It appears a partial sum is computed for each join key, as the result > returned would be the answer for all rows matching intNum === 1. > If only the rows with intNum === 2 are included, the answer given is null: > > {code:java} > scala> val df3 = df.filter($"intNum" === lit(2)) > df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: > decimal(38,18), intNum: int] > scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, > "intNum").agg(sum("decNum")) > df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] > scala> df4.show(40,false) > --- > sum(decNum) > --- > null > --- > > {code} > > The correct answer, 10., doesn't fit in > the DataType picked for the result, decimal(38,18), so an overflow occurs, > which Spark then converts to null. > The first example, which doesn't filter out the intNum === 1 values should > also return null, indicating overflow, but it doesn't. This may mislead the > user to think a valid sum was computed. > If whole-stage code gen is turned off: > spark.conf.set("spark.sql.codegen.wholeStage", false) > ... incorrect results are not returned because the overflow is caught as an > exception: > java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 > exceeds max precision 38 > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32163: -- Description: Note that this is just an optimization issue and not a regression. The newly introduced optimizer doesn't optimize this corner case. If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. was: If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Note that this is just an optimization issue and not a regression. The newly > introduced optimizer doesn't optimize this corner case. > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32164) GeneralizedLinearRegressionSummary optimization
[ https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-32164. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28990 [https://github.com/apache/spark/pull/28990] > GeneralizedLinearRegressionSummary optimization > --- > > Key: SPARK-32164 > URL: https://issues.apache.org/jira/browse/SPARK-32164 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > > compute several statistics on single pass -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32164) GeneralizedLinearRegressionSummary optimization
[ https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-32164: -- Assignee: zhengruifeng > GeneralizedLinearRegressionSummary optimization > --- > > Key: SPARK-32164 > URL: https://issues.apache.org/jira/browse/SPARK-32164 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > compute several statistics on single pass -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-30985: Description: SPARK_CONF_DIR hosts configuration files like, 1) spark-defaults.conf - containing all the spark properties. 2) log4j.properties - Logger configuration. 3) spark-env.sh - Environment variables to be setup at driver and executor. 4) core-site.xml - Hadoop related configuration. 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. 6) metrics.properties - Spark metrics. 7) Any user specific - library or framework specific configuration file. Traditionally, SPARK_CONF_DIR has been the home to all user specific configuration files. So this feature, will let the user specific configuration files be mounted on the driver and executor pods' SPARK_CONF_DIR. Please review the attached design doc, for more details. [https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing] was: SPARK_CONF_DIR hosts configuration files like, 1) spark-defaults.conf - containing all the spark properties. 2) log4j.properties - Logger configuration. 3) spark-env.sh - Environment variables to be setup at driver and executor. 4) core-site.xml - Hadoop related configuration. 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. 6) metrics.properties - Spark metrics. 7) Any user specific - library or framework specific configuration file. Traditionally, SPARK_CONF_DIR has been the home to all user specific configuration files. So this feature, will let the user specific configuration files be mounted on the driver and executor pods' SPARK_CONF_DIR. Please review the attached design doc, for more details. https://drive.google.com/file/d/1p6gaJyOJdlB1rosJDFner3bj5VekTCJ3/view?usp=sharing > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32047) Add provider disable possibility just like in delegation token provider
[ https://issues.apache.org/jira/browse/SPARK-32047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152784#comment-17152784 ] Gabor Somogyi commented on SPARK-32047: --- Started to work on this. > Add provider disable possibility just like in delegation token provider > --- > > Key: SPARK-32047 > URL: https://issues.apache.org/jira/browse/SPARK-32047 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > There is an enable flag in delegation provider area > "spark.security.credentials.%s.enabled". > It would be good to add similar to the JDBC secure connection provider area > because this would make embedded providers interchangeable (embedded can be > turned off and another provider w/ a different name can be registered). This > make sense only if we create API for the secure JDBC connection provider. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32209) Re-use GetTimestamp in ParseToDate
[ https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32209. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28999 [https://github.com/apache/spark/pull/28999] > Re-use GetTimestamp in ParseToDate > -- > > Key: SPARK-32209 > URL: https://issues.apache.org/jira/browse/SPARK-32209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Replace the combination of expressions SecondsToTimestamp and UnixTimestamp > by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary > parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> > date. After the changes, the chain will be: string -> timestamp -> date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32209) Re-use GetTimestamp in ParseToDate
[ https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32209: - Assignee: Maxim Gekk > Re-use GetTimestamp in ParseToDate > -- > > Key: SPARK-32209 > URL: https://issues.apache.org/jira/browse/SPARK-32209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Replace the combination of expressions SecondsToTimestamp and UnixTimestamp > by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary > parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> > date. After the changes, the chain will be: string -> timestamp -> date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-32211: -- Description: The test fails with the following error: {code:java} 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally without authentication) {code} This is because the docker image contains MariaDB version 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and mariadb-plugin-gssapi-server installation triggered unwanted database upgrade inside the docker image. was: The test fails with the following error: {code:java} 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally without authentication) {code} This is because the docker image contains MariaDB version 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and mariadb-plugin-gssapi-server installation triggered database upgrade inside the docker image. > MariaDBKrbIntegrationSuite fails because of unwanted server upgrade > --- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered unwanted database upgrade > inside the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152763#comment-17152763 ] Apache Spark commented on SPARK-32211: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/29025 > MariaDBKrbIntegrationSuite fails because of unwanted server upgrade > --- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered database upgrade inside > the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32211: Assignee: (was: Apache Spark) > MariaDBKrbIntegrationSuite fails because of unwanted server upgrade > --- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered database upgrade inside > the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
[ https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32211: Assignee: Apache Spark > MariaDBKrbIntegrationSuite fails because of unwanted server upgrade > --- > > Key: SPARK-32211 > URL: https://issues.apache.org/jira/browse/SPARK-32211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > The test fails with the following error: > {code:java} > 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' > user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally > without authentication) > {code} > This is because the docker image contains MariaDB version > 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and > mariadb-plugin-gssapi-server installation triggered database upgrade inside > the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
Gabor Somogyi created SPARK-32211: - Summary: MariaDBKrbIntegrationSuite fails because of unwanted server upgrade Key: SPARK-32211 URL: https://issues.apache.org/jira/browse/SPARK-32211 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Gabor Somogyi The test fails with the following error: {code:java} 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally without authentication) {code} This is because the docker image contains MariaDB version 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and mariadb-plugin-gssapi-server installation triggered database upgrade inside the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32210) Failed to serialize large MapStatuses
Yuming Wang created SPARK-32210: --- Summary: Failed to serialize large MapStatuses Key: SPARK-32210 URL: https://issues.apache.org/jira/browse/SPARK-32210 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.4 Reporter: Yuming Wang Driver side exception: {noformat} 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] spark.MapOutputTrackerMaster:91 : java.lang.NegativeArraySizeException at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) at org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) at org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) at org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] spark.MapOutputTrackerMaster:91 : java.lang.NegativeArraySizeException at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) at org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) at org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) at org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] spark.MapOutputTrackerMaster:91 : {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31975) Throw user facing error when use WindowFunction directly
[ https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31975: --- Assignee: ulysses you > Throw user facing error when use WindowFunction directly > > > Key: SPARK-31975 > URL: https://issues.apache.org/jira/browse/SPARK-31975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31975) Throw user facing error when use WindowFunction directly
[ https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31975. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28808 [https://github.com/apache/spark/pull/28808] > Throw user facing error when use WindowFunction directly > > > Key: SPARK-31975 > URL: https://issues.apache.org/jira/browse/SPARK-31975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738 ] Lantao Jin edited comment on SPARK-29038 at 7/7/20, 1:14 PM: - Hi [~AidenZhang], our focusings of MV in recent months are two parts. One is the rewrite algothim optimization. Such as forbidding count distict post aggregation, avoid unnecessary rewrite when do relation replacement. Another is bugfix in MV refresh. Use a Spark listener to deliver the metastore events to refresh. Some parts depends on third part system. So maybe only interfaces are available in community Spark. I don't do the partial/incremental refresh since it's not a blocker for us. I am not sure the community are still interested the feature, but we are moving existing implementation to Spark3.0 now. was (Author: cltlfcjin): Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the rewrite algothim optimization. Such as forbidding count distict post aggregation, avoid unnecessary rewrite when do relation replacement. Another is bugfix in MV refresh. Use a Spark listener to deliver the metastore events to refresh. Some parts depends on third part system. So maybe only interfaces are available in community Spark. I don't do the partial/incremental refresh since it's not a blocker for us. I am not sure the community are still interested the feature, but we are moving existing implementation to Spark3.0 now. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738 ] Lantao Jin commented on SPARK-29038: Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the rewrite algothim optimization. Such as forbidding count distict post aggregation, avoid unnecessary rewrite when do relation replacement. Another is bugfix in MV refresh. Use a Spark listener to deliver the metastore events to refresh. Some parts depends on third part system. So maybe only interfaces are available in community Spark. I don't do the partial/incremental refresh since it's not a blocker for us. I am not sure the community are still interested the feature, but we are moving existing implementation to Spark3.0 now. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32001: Assignee: Apache Spark > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32001: Assignee: (was: Apache Spark) > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152725#comment-17152725 ] Apache Spark commented on SPARK-32001: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/29024 > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152683#comment-17152683 ] Nilesh Patil commented on SPARK-32205: -- [~JinxinTang] In mysql there is having DATETIME datatype that accepts *-12-31 23:59:59* value. Using spark is it possible to write in DATETIME datatype ? > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152679#comment-17152679 ] JinxinTang edited comment on SPARK-32205 at 7/7/20, 11:44 AM: -- [~nileshr.patil] Sure, this problem seems in mysql side, because mysql can not accpect *-12-31 23:59:59* as timestamp type. We can insert directly from mysql client to make sure. was (Author: jinxintang): Sure, this problem seems in mysql side, because we cannot insert *-12-31 23:59:59* to mysql timestamp type. > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32205) Writing timestamp in mysql gets fails
[ https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152679#comment-17152679 ] JinxinTang commented on SPARK-32205: Sure, this problem seems in mysql side, because we cannot insert *-12-31 23:59:59* to mysql timestamp type. > Writing timestamp in mysql gets fails > -- > > Key: SPARK-32205 > URL: https://issues.apache.org/jira/browse/SPARK-32205 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > > When we are writing to mysql with TIMESTAMP column it supports only range > '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME > datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range. > How to map spark timestamp datatype to mysql datetime datatype in order to > use higher supporting range ? > [https://dev.mysql.com/doc/refman/5.7/en/datetime.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31635) Spark SQL Sort fails when sorting big data points
[ https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152674#comment-17152674 ] George George commented on SPARK-31635: --- Hello [~Chen Zhang], Thanks a lot for getting back on this. I would agree with you that it is an improvement. However, I thought because it failed when using dataframe api and there is no documentation on it, that it is a bug. Your suggestion sounds really good to me and I think it's good to give the user the opportunity to configure this. Basically, then the user can decide if he waits a little more on the result or put more pressure on the driver. I could also try to submit a PR, but I guess I would need a more time on it. Just let me know if you would rather wait for my pr or do it yourself. Best, George > Spark SQL Sort fails when sorting big data points > - > > Key: SPARK-31635 > URL: https://issues.apache.org/jira/browse/SPARK-31635 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: George George >Priority: Major > > Please have a look at the example below: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](25)(Point(1,2, 100) > test.toDF().as[Nested].sort("a").take(1) > {code} > *Sorting* big data objects using *Spark Dataframe* is failing with following > exception: > {code:java} > 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized > results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize > (100.0 MB) > [Stage 0:==> (12 + 3) / > 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total > size of serialized results of 13 tasks (100.1 MB) is bigger than > spark.driver.maxResu > {code} > However using the *RDD API* is working and no exception is thrown: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](25)(Point(1,2, 100) > test.sortBy(_.a).take(1) > {code} > For both code snippets we started the spark shell with exactly the same > arguments: > {code:java} > spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB" > {code} > Even if we increase the spark.driver.maxResultSize, the executors still get > killed for our use case. The interesting thing is that when using the RDD API > directly the problem is not there. *Looks like there is a bug in dataframe > sort because is shuffling too much data to the driver?* > Note: this is a small example and I reduced the spark.driver.maxResultSize to > a smaller size, but in our application I've tried setting it to 8GB but as > mentioned above the job was killed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32209) Re-use GetTimestamp in ParseToDate
[ https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32209: Assignee: Apache Spark > Re-use GetTimestamp in ParseToDate > -- > > Key: SPARK-32209 > URL: https://issues.apache.org/jira/browse/SPARK-32209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Replace the combination of expressions SecondsToTimestamp and UnixTimestamp > by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary > parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> > date. After the changes, the chain will be: string -> timestamp -> date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32209) Re-use GetTimestamp in ParseToDate
[ https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32209: Assignee: (was: Apache Spark) > Re-use GetTimestamp in ParseToDate > -- > > Key: SPARK-32209 > URL: https://issues.apache.org/jira/browse/SPARK-32209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > Replace the combination of expressions SecondsToTimestamp and UnixTimestamp > by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary > parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> > date. After the changes, the chain will be: string -> timestamp -> date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32209) Re-use GetTimestamp in ParseToDate
[ https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152641#comment-17152641 ] Apache Spark commented on SPARK-32209: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28999 > Re-use GetTimestamp in ParseToDate > -- > > Key: SPARK-32209 > URL: https://issues.apache.org/jira/browse/SPARK-32209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > Replace the combination of expressions SecondsToTimestamp and UnixTimestamp > by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary > parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> > date. After the changes, the chain will be: string -> timestamp -> date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32209) Re-use GetTimestamp in ParseToDate
Maxim Gekk created SPARK-32209: -- Summary: Re-use GetTimestamp in ParseToDate Key: SPARK-32209 URL: https://issues.apache.org/jira/browse/SPARK-32209 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Replace the combination of expressions SecondsToTimestamp and UnixTimestamp by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> date. After the changes, the chain will be: string -> timestamp -> date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31635) Spark SQL Sort fails when sorting big data points
[ https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147653#comment-17147653 ] Chen Zhang edited comment on SPARK-31635 at 7/7/20, 10:14 AM: -- In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_ The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global bucket Sorting, and the required number of data is returned after the global sorting result is obtained.All major computation are performed in the executor process. The execution logic of _RDD.takeOrdered()_ is to compute TOPK in each RDD partition in the executor process(by QuickSelect), and then return each TOPK result to the driver process for merging(by PriorityQueue). To get the same result, it is obvious that the second method based on QuickSelect/PriorityQueue has better performance. I think that the implementation of _RDD.takeOrdered()_ can be improved, using a configurable option to decide whether the TOPK data merge process occurs in the driver process or the executor process. If it occurs in the driver process, it can reduce the time for waiting for computation. If it occurs in the executor process, it can reduce the memory pressure of the driver process. something like: (org.apache.spark.rdd.RDD class) {code:scala} def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0 || partitions.length == 0) { Array.empty } else { if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= collectionUtils.takeOrdered(items, num)(ord) Iterator.single(queue) } mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } else { mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.repartition(1).mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.collect() } } } {code} was (Author: chen zhang): In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_ The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global bucket Sorting, and the required number of data is returned after the global sorting result is obtained.All major computation are performed in the executor process. The execution logic of _RDD.takeOrdered()_ is to compute TOPK in each RDD partition in the executor process(by QuickSelect), and then return each TOPK result to the driver process for merging(by PriorityQueue). To get the same result, it is obvious that the second method based on PriorityQueue has better performance. I think that the implementation of _RDD.takeOrdered()_ can be improved, using a configurable option to decide whether the TOPK data merge process occurs in the driver process or the executor process. If it occurs in the driver process, it can reduce the time for waiting for computation. If it occurs in the executor process, it can reduce the memory pressure of the driver process. something like: (org.apache.spark.rdd.RDD class) {code:scala} def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0 || partitions.length == 0) { Array.empty } else { if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= collectionUtils.takeOrdered(items, num)(ord) Iterator.single(queue) } mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } else { mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.repartition(1).mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.collect() } } } {code} > Spark SQL Sort fails when sorting big data points > - > > Key: SPARK-31635 > URL: https://issues.apache.org/jira/browse/SPARK-31635 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: George George >Priority: Major > > Please have a look at the example below: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => >
[jira] [Commented] (SPARK-31635) Spark SQL Sort fails when sorting big data points
[ https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152618#comment-17152618 ] Chen Zhang commented on SPARK-31635: This problem is not a bug, but I think it is necessary to improve the code implementation. I will create a new Improvement issue to discuss and try to submit a PR. > Spark SQL Sort fails when sorting big data points > - > > Key: SPARK-31635 > URL: https://issues.apache.org/jira/browse/SPARK-31635 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: George George >Priority: Major > > Please have a look at the example below: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](25)(Point(1,2, 100) > test.toDF().as[Nested].sort("a").take(1) > {code} > *Sorting* big data objects using *Spark Dataframe* is failing with following > exception: > {code:java} > 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized > results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize > (100.0 MB) > [Stage 0:==> (12 + 3) / > 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total > size of serialized results of 13 tasks (100.1 MB) is bigger than > spark.driver.maxResu > {code} > However using the *RDD API* is working and no exception is thrown: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](25)(Point(1,2, 100) > test.sortBy(_.a).take(1) > {code} > For both code snippets we started the spark shell with exactly the same > arguments: > {code:java} > spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB" > {code} > Even if we increase the spark.driver.maxResultSize, the executors still get > killed for our use case. The interesting thing is that when using the RDD API > directly the problem is not there. *Looks like there is a bug in dataframe > sort because is shuffling too much data to the driver?* > Note: this is a small example and I reduced the spark.driver.maxResultSize to > a smaller size, but in our application I've tried setting it to 8GB but as > mentioned above the job was killed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31635) Spark SQL Sort fails when sorting big data points
[ https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147653#comment-17147653 ] Chen Zhang edited comment on SPARK-31635 at 7/7/20, 10:12 AM: -- In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_ The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global bucket Sorting, and the required number of data is returned after the global sorting result is obtained.All major computation are performed in the executor process. The execution logic of _RDD.takeOrdered()_ is to compute TOPK in each RDD partition in the executor process(by QuickSelect), and then return each TOPK result to the driver process for merging(by PriorityQueue). To get the same result, it is obvious that the second method based on PriorityQueue has better performance. I think that the implementation of _RDD.takeOrdered()_ can be improved, using a configurable option to decide whether the TOPK data merge process occurs in the driver process or the executor process. If it occurs in the driver process, it can reduce the time for waiting for computation. If it occurs in the executor process, it can reduce the memory pressure of the driver process. something like: (org.apache.spark.rdd.RDD class) {code:scala} def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0 || partitions.length == 0) { Array.empty } else { if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= collectionUtils.takeOrdered(items, num)(ord) Iterator.single(queue) } mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } else { mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.repartition(1).mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.collect() } } } {code} was (Author: chen zhang): In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_ The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global bucket Sorting, and the required number of data is returned after the global sorting result is obtained.All major computation are performed in the executor process. The execution logic of _RDD.takeOrdered()_ is to compute TOPK(by PriorityQueue) in each RDD partition in the executor process, and then return each TOPK result to the driver process for merging. To get the same result, it is obvious that the second method based on PriorityQueue has better performance. I think that the implementation of _RDD.takeOrdered()_ can be improved, using a configurable option to decide whether the TOPK data merge process occurs in the driver process or the executor process. If it occurs in the driver process, it can reduce the time for waiting for computation. If it occurs in the executor process, it can reduce the memory pressure of the driver process. something like: (org.apache.spark.rdd.RDD class) {code:scala} def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0 || partitions.length == 0) { Array.empty } else { if (conf.getBoolean("spark.rdd.take.ordered.driver.merge", true)) { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= collectionUtils.takeOrdered(items, num)(ord) Iterator.single(queue) } mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } else { mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.repartition(1).mapPartitions { items => collectionUtils.takeOrdered(items, num)(ord) }.collect() } } } {code} > Spark SQL Sort fails when sorting big data points > - > > Key: SPARK-31635 > URL: https://issues.apache.org/jira/browse/SPARK-31635 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: George George >Priority: Major > > Please have a look at the example below: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](25)(Point(1,2, 100) >