[jira] [Commented] (SPARK-30983) Support more than 5 typed column in typed Dataset.select API
[ https://issues.apache.org/jira/browse/SPARK-30983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047278#comment-17047278 ] L. C. Hsieh commented on SPARK-30983: - cc [~cloud_fan] > Support more than 5 typed column in typed Dataset.select API > > > Key: SPARK-30983 > URL: https://issues.apache.org/jira/browse/SPARK-30983 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Priority: Major > > Because Dataset only provides overloading typed select API to at most 5 typed > columns, once more than 5 typed columns given, the select API call will go > for untyped one. > Currently users cannot call typed select with more than 5 typed columns. > There are few options: > 1. Expose Dataset.selectUntyped (could rename it) to accept any number (due > to the limit of ExpressionEncoder.tuple, at most 22 actually) of typed > columns. Pros: not need to add too much code in Dataset. Cons: The returned > type is generally Dataset[_], not a specified one like Dataset[(U1, U2)] for > the overloading method. > 2. Add more overloading typed select APIs up to 22 typed column inputs. Pros: > Clear returned type. Cons: A lot of code to be added to Dataset for just > corner cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30983) Support more than 5 typed column in typed Dataset.select API
L. C. Hsieh created SPARK-30983: --- Summary: Support more than 5 typed column in typed Dataset.select API Key: SPARK-30983 URL: https://issues.apache.org/jira/browse/SPARK-30983 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: L. C. Hsieh Because Dataset only provides overloading typed select API to at most 5 typed columns, once more than 5 typed columns given, the select API call will go for untyped one. Currently users cannot call typed select with more than 5 typed columns. There are few options: 1. Expose Dataset.selectUntyped (could rename it) to accept any number (due to the limit of ExpressionEncoder.tuple, at most 22 actually) of typed columns. Pros: not need to add too much code in Dataset. Cons: The returned type is generally Dataset[_], not a specified one like Dataset[(U1, U2)] for the overloading method. 2. Add more overloading typed select APIs up to 22 typed column inputs. Pros: Clear returned type. Cons: A lot of code to be added to Dataset for just corner cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30982) List All the removed APIs of Spark SQL and Core
[ https://issues.apache.org/jira/browse/SPARK-30982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30982: - Attachment: sql_signature.diff added_sql_class 1_process_sql_script.sh > List All the removed APIs of Spark SQL and Core > --- > > Key: SPARK-30982 > URL: https://issues.apache.org/jira/browse/SPARK-30982 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: zhengruifeng >Priority: Major > Attachments: 1_process_core_script.sh, 1_process_sql_script.sh, > added_core_class, added_sql_class, core_signature.diff, sql_signature.diff > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30982) List All the removed APIs of Spark SQL and Core
[ https://issues.apache.org/jira/browse/SPARK-30982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30982: - Attachment: core_signature.diff added_core_class 1_process_core_script.sh > List All the removed APIs of Spark SQL and Core > --- > > Key: SPARK-30982 > URL: https://issues.apache.org/jira/browse/SPARK-30982 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: zhengruifeng >Priority: Major > Attachments: 1_process_core_script.sh, added_core_class, > core_signature.diff > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30982) List All the removed APIs of Spark SQL and Core
[ https://issues.apache.org/jira/browse/SPARK-30982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-30982: --- Assignee: zhengruifeng (was: Xiao Li) > List All the removed APIs of Spark SQL and Core > --- > > Key: SPARK-30982 > URL: https://issues.apache.org/jira/browse/SPARK-30982 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30982) List All the removed APIs of Spark SQL and Core
Xiao Li created SPARK-30982: --- Summary: List All the removed APIs of Spark SQL and Core Key: SPARK-30982 URL: https://issues.apache.org/jira/browse/SPARK-30982 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30981) Fix flaky "Test basic decommissioning" test
[ https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30981: - Assignee: (was: Dongjoon Hyun) > Fix flaky "Test basic decommissioning" test > --- > > Key: SPARK-30981 > URL: https://issues.apache.org/jira/browse/SPARK-30981 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > - https://github.com/apache/spark/pull/27721 > {code} > - Test basic decommissioning *** FAILED *** > The code passed to eventually never returned normally. Attempted 126 times > over 2.010095245067 minutes. Last failure message: "++ id -u > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30981) Fix flaky "Test basic decommissioning" test
[ https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047270#comment-17047270 ] Dongjoon Hyun commented on SPARK-30981: --- Could you take a look at this, [~holden]? > Fix flaky "Test basic decommissioning" test > --- > > Key: SPARK-30981 > URL: https://issues.apache.org/jira/browse/SPARK-30981 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > - https://github.com/apache/spark/pull/27721 > {code} > - Test basic decommissioning *** FAILED *** > The code passed to eventually never returned normally. Attempted 126 times > over 2.010095245067 minutes. Last failure message: "++ id -u > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30981) Fix flaky "Test basic decommissioning" test
Dongjoon Hyun created SPARK-30981: - Summary: Fix flaky "Test basic decommissioning" test Key: SPARK-30981 URL: https://issues.apache.org/jira/browse/SPARK-30981 Project: Spark Issue Type: Bug Components: Kubernetes, Tests Affects Versions: 3.1.0 Reporter: Dongjoon Hyun Assignee: Dongjoon Hyun - https://github.com/apache/spark/pull/27721 {code} - Test basic decommissioning *** FAILED *** The code passed to eventually never returned normally. Attempted 126 times over 2.010095245067 minutes. Last failure message: "++ id -u {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25474) Update the docs for spark.sql.statistics.fallBackToHdfs
[ https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25474: -- Summary: Update the docs for spark.sql.statistics.fallBackToHdfs (was: Support `spark.sql.statistics.fallBackToHdfs` in data source tables) > Update the docs for spark.sql.statistics.fallBackToHdfs > --- > > Key: SPARK-25474 > URL: https://issues.apache.org/jira/browse/SPARK-25474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 > Environment: Spark 2.3.1 > Hadoop 2.7.2 >Reporter: Ayush Anubhava >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > *Description :* Size in bytes of the query is coming in EB in case of parquet > datasource. this would impact the performance , since join queries would > always go as Sort Merge Join. > *Precondition :* spark.sql.statistics.fallBackToHdfs = true > Steps: > {code:java} > 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110; > +++--+ > | a | b | > +++--+ > | 1 | a | > | 2 | b | > +++--+ > {code} > *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}* > {code:java} > explain cost select * from t1110; > | == Optimized Logical Plan == > Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none) > == Physical Plan == > *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct | > {code} > *{color:#d04437}This would lead to Sort Merge Join in case of join > query{color}* > {code:java} > 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > explain select * from t1110 t1 join t110 t2 on t1.a=t2.a; > | == Physical Plan == > *(5) SortMergeJoin [a#23], [a#55], Inner > :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(a#23, 200) > : +- *(1) Project [a#23, b#24] > : +- *(1) Filter isnotnull(a#23) > : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: > Parquet, Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct > +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#55, 200) > +- *(3) Project [a#55, b#56] > +- *(3) Filter isnotnull(a#55) > +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], > PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct | > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30902) default table provider should be decided by catalog implementations
[ https://issues.apache.org/jira/browse/SPARK-30902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30902. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27650 [https://github.com/apache/spark/pull/27650] > default table provider should be decided by catalog implementations > --- > > Key: SPARK-30902 > URL: https://issues.apache.org/jira/browse/SPARK-30902 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions
[ https://issues.apache.org/jira/browse/SPARK-26599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26599. - > BroardCast hint can not work with PruneFileSourcePartitions > --- > > Key: SPARK-26599 > URL: https://issues.apache.org/jira/browse/SPARK-26599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > BroardCast hint can not work with `PruneFileSourcePartitions`, for example, > when the filter condition p is a partition field, table b in SQL below cannot > be broadcast. > ` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where > p=1) a " + > "join (select a,b from par_1 where p=1) b on a.a=b.a").explain` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions
[ https://issues.apache.org/jira/browse/SPARK-26599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26599: -- Issue Type: Bug (was: Improvement) > BroardCast hint can not work with PruneFileSourcePartitions > --- > > Key: SPARK-26599 > URL: https://issues.apache.org/jira/browse/SPARK-26599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > BroardCast hint can not work with `PruneFileSourcePartitions`, for example, > when the filter condition p is a partition field, table b in SQL below cannot > be broadcast. > ` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where > p=1) a " + > "join (select a,b from par_1 where p=1) b on a.a=b.a").explain` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive
[ https://issues.apache.org/jira/browse/SPARK-30980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradyumn Agrawal updated SPARK-30980: - Shepherd: Apache Spark > Issue not resolved of Caught Hive MetaException attempting to get partition > metadata by filter from Hive > > > Key: SPARK-30980 > URL: https://issues.apache.org/jira/browse/SPARK-30980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.2 > Environment: 2.4.0-CDH6.3.1 (which I guess points to Spark Version > 2.4.2) >Reporter: Pradyumn Agrawal >Priority: Major > > I am querying on table created in Hive. Getting repetitive exception of > failing to query data with following stacktrace. > > {code:java} > // code placeholder > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: > Caught Hive MetaException attempting to get partition metadata by filter from > Hive. You can set the Spark configuration setting > spark.sql.hive.manageFilesourcePartitions to false to work around this > problem, however this will result in degraded performance. Please report a > bug: https://issues.apache.org/jira/browse/SPARK at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.sc
[jira] [Updated] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive
[ https://issues.apache.org/jira/browse/SPARK-30980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradyumn Agrawal updated SPARK-30980: - Description: I am querying on table created in Hive. Getting repetitive exception of failing to query data with following stacktrace. {code:java} // code placeholder java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
[jira] [Created] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive
Pradyumn Agrawal created SPARK-30980: Summary: Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive Key: SPARK-30980 URL: https://issues.apache.org/jira/browse/SPARK-30980 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.2, 2.4.0 Environment: 2.4.0-CDH6.3.1 (which I guess points to Spark Version 2.4.2) Reporter: Pradyumn Agrawal I am querying on table created in Hive. Getting repetitive exception of failing to query data with following stacktrace. {code:java} // code placeholder {code} java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPla
[jira] [Assigned] (SPARK-30972) PruneHiveTablePartitions should be executed as earlyScanPushDownRules
[ https://issues.apache.org/jira/browse/SPARK-30972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30972: --- Assignee: wuyi > PruneHiveTablePartitions should be executed as earlyScanPushDownRules > - > > Key: SPARK-30972 > URL: https://issues.apache.org/jira/browse/SPARK-30972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Similar to PruneFileSourcePartitions, PruneHiveTablePartitions should also be > executed as earlyScanPushDownRules to eliminate the impact on statistic > computation later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30972) PruneHiveTablePartitions should be executed as earlyScanPushDownRules
[ https://issues.apache.org/jira/browse/SPARK-30972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30972. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27723 [https://github.com/apache/spark/pull/27723] > PruneHiveTablePartitions should be executed as earlyScanPushDownRules > - > > Key: SPARK-30972 > URL: https://issues.apache.org/jira/browse/SPARK-30972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Similar to PruneFileSourcePartitions, PruneHiveTablePartitions should also be > executed as earlyScanPushDownRules to eliminate the impact on statistic > computation later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30681) Add higher order functions API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-30681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30681. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27406 [https://github.com/apache/spark/pull/27406] > Add higher order functions API to PySpark > - > > Key: SPARK-30681 > URL: https://issues.apache.org/jira/browse/SPARK-30681 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > As of 3.0.0 higher order functions are available in SQL and Scala, but not in > PySpark, forcing Python users to invoke these through {{expr}}, > {{selectExpr}} or {{sql}}. > This is error prone and not well documented. Spark should provide > {{pyspark.sql}} wrappers that accept plain Python functions (of course within > limits of {{(*Column) -> Column}}) as arguments. > {code:python} > df.select(transform("values", lambda c: trim(upper(c))) > def increment_values(k: Column, v: Column) -> Column: > return v + 1 > df.select(transform_values("data"), increment_values) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30682) Add higher order functions API to SparkR
[ https://issues.apache.org/jira/browse/SPARK-30682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30682: Assignee: Maciej Szymkiewicz > Add higher order functions API to SparkR > > > Key: SPARK-30682 > URL: https://issues.apache.org/jira/browse/SPARK-30682 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > As of 3.0.0 higher order functions are available in SQL and Scala, but not in > SparkR forcing R users to invoke these through {{expr}}, {{selectExpr}} or > {{sql}}. > It would be great if Spark provided high level wrappers that accept plain R > functions operating on SQL expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30682) Add higher order functions API to SparkR
[ https://issues.apache.org/jira/browse/SPARK-30682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30682. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27433 [https://github.com/apache/spark/pull/27433] > Add higher order functions API to SparkR > > > Key: SPARK-30682 > URL: https://issues.apache.org/jira/browse/SPARK-30682 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > As of 3.0.0 higher order functions are available in SQL and Scala, but not in > SparkR forcing R users to invoke these through {{expr}}, {{selectExpr}} or > {{sql}}. > It would be great if Spark provided high level wrappers that accept plain R > functions operating on SQL expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30681) Add higher order functions API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-30681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30681: Assignee: Maciej Szymkiewicz > Add higher order functions API to PySpark > - > > Key: SPARK-30681 > URL: https://issues.apache.org/jira/browse/SPARK-30681 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > As of 3.0.0 higher order functions are available in SQL and Scala, but not in > PySpark, forcing Python users to invoke these through {{expr}}, > {{selectExpr}} or {{sql}}. > This is error prone and not well documented. Spark should provide > {{pyspark.sql}} wrappers that accept plain Python functions (of course within > limits of {{(*Column) -> Column}}) as arguments. > {code:python} > df.select(transform("values", lambda c: trim(upper(c))) > def increment_values(k: Column, v: Column) -> Column: > return v + 1 > df.select(transform_values("data"), increment_values) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30955) Exclude Generate output when aliasing in nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-30955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30955. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27702 [https://github.com/apache/spark/pull/27702] > Exclude Generate output when aliasing in nested column pruning > -- > > Key: SPARK-30955 > URL: https://issues.apache.org/jira/browse/SPARK-30955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > When aliasing in nested column pruning on Project on top of Generate, we > should exclude Generate outputs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-14643. - Resolution: Won't Fix > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Task > Components: Build, Project Infra >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Blocker > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047170#comment-17047170 ] Hyukjin Kwon commented on SPARK-26836: -- I am lowering the priority to Critical as it's at least not a regression and doesn't look blocking Spark 3.0; however, indeed we should treat correctness issues at least Critical+. > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christo
[jira] [Updated] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26836: - Priority: Critical (was: Blocker) > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fishfinger,Bean,Pertwee,2019-02-06 > {code} > If I try the same query from Hive and not fro
[jira] [Updated] (SPARK-30979) spark-submit - no need to resolve dependencies in kubernetes cluster mode
[ https://issues.apache.org/jira/browse/SPARK-30979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dyno updated SPARK-30979: - Summary: spark-submit - no need to resolve dependencies in kubernetes cluster mode (was: no need to resolve dependencies in kubernetes cluster mode) > spark-submit - no need to resolve dependencies in kubernetes cluster mode > - > > Key: SPARK-30979 > URL: https://issues.apache.org/jira/browse/SPARK-30979 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.5 > Environment: spark-operator with spark-2.4.4 >Reporter: Dyno >Priority: Minor > > [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L301] > > when use spark-operator, observed that spark operator is trying to download > all the dependencies, but it is really not necessary as the driver will do it > again. > should the check be ``` > if (!isMesosCluster && !isStandAloneCluster && !isKubernetesCluster) { > } > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30979) no need to resolve dependencies in kubernetes cluster mode
Dyno created SPARK-30979: Summary: no need to resolve dependencies in kubernetes cluster mode Key: SPARK-30979 URL: https://issues.apache.org/jira/browse/SPARK-30979 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.5 Environment: spark-operator with spark-2.4.4 Reporter: Dyno [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L301] when use spark-operator, observed that spark operator is trying to download all the dependencies, but it is really not necessary as the driver will do it again. should the check be ``` if (!isMesosCluster && !isStandAloneCluster && !isKubernetesCluster) { } ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30968) Upgrade aws-java-sdk-sts to 1.11.655
[ https://issues.apache.org/jira/browse/SPARK-30968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30968: - Assignee: Dongjoon Hyun > Upgrade aws-java-sdk-sts to 1.11.655 > > > Key: SPARK-30968 > URL: https://issues.apache.org/jira/browse/SPARK-30968 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30968) Upgrade aws-java-sdk-sts to 1.11.655
[ https://issues.apache.org/jira/browse/SPARK-30968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30968. --- Fix Version/s: 3.0.0 2.4.6 Resolution: Fixed Issue resolved by pull request 27720 [https://github.com/apache/spark/pull/27720] > Upgrade aws-java-sdk-sts to 1.11.655 > > > Key: SPARK-30968 > URL: https://issues.apache.org/jira/browse/SPARK-30968 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30855) Issue using 'explode' function followed by a (*)star expand selection of resulting struct
[ https://issues.apache.org/jira/browse/SPARK-30855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047029#comment-17047029 ] Dongjoon Hyun commented on SPARK-30855: --- According to the public plan, no. Probably, 3.0.0 RC1? - https://spark.apache.org/versioning-policy.html 3.0.0 will arrive before Spark Summit 2020. > Issue using 'explode' function followed by a (*)star expand selection of > resulting struct > - > > Key: SPARK-30855 > URL: https://issues.apache.org/jira/browse/SPARK-30855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Benoit Roy >Priority: Major > > An exception occurs when trying to use a _* expand_ selection after > performing an explode on a array of struct. > I am testing this on preview2 release of spark. > Here's a public repo containing a very simple scala test case that reproduces > the issue > {code:java} > git clone g...@github.com:benoitroy/spark-30855.git{code} > Simply execute the *Spark30855Tests* class. > On a simple schema such as: > {code:java} > root > |-- k1: string (nullable = true) > |-- k2: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- k2.k1: struct (nullable = true) > ||||-- k2.k1.k1: string (nullable = true) > ||||-- k2.k1.k2: string (nullable = true) > |||-- k2.k2: string (nullable = true) {code} > The following test case will fail on the 'col.*' selection. > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > import org.scalatest.funsuite.AnyFunSuite > class Spark38055Tests extends AnyFunSuite { > test("") { > // > val path = "src/test/data/json/data.json" > // > val spark = SparkSession > .builder() > .appName("Testing.") > .config("spark.master", "local") > .getOrCreate(); > // > val df = spark.read.json(path) > // SUCCESS! > df.printSchema() > // SUCCESS! > df.select(explode(col("k2"))).show() > // SUCCESS! > df.select(explode(col("k2"))).select("col.*").printSchema() > // FAIL! > df.select(explode(col("k2"))).select("col.*").show() > } > } {code} > > The test class demonstrates two cases, one where it fails (as shown above) > and another where it succeeds. There is only a slight variation on the > schema of both cases. The succeeding case works on the following schema: > {code:java} > root > |-- k1: string (nullable = true) > |-- k2: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- k2.k1: struct (nullable = true) > ||||-- k2.k1.k1: string (nullable = true) > |||-- k2.k2: string (nullable = true) {code} > You will notice that this schema simply removes a field from the nested > struct 'k2.k1'. > > The stacktrace produced by the failing case is show below: > {code:java} > Binding attribute, tree: _gen_alias_23#23Binding attribute, tree: > _gen_alias_23#23org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: _gen_alias_23#23 at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.ap
[jira] [Comment Edited] (SPARK-30855) Issue using 'explode' function followed by a (*)star expand selection of resulting struct
[ https://issues.apache.org/jira/browse/SPARK-30855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047029#comment-17047029 ] Dongjoon Hyun edited comment on SPARK-30855 at 2/27/20 11:00 PM: - According to the public plan, no. Probably, 3.0.0 RC1? - https://spark.apache.org/versioning-policy.html 3.0.0 will arrive before Spark Summit 2020. was (Author: dongjoon): According to the public plan, no. Probably, 3.0.0 RC1? - https://spark.apache.org/versioning-policy.html 3.0.0 will arrive before Spark Summit 2020. > Issue using 'explode' function followed by a (*)star expand selection of > resulting struct > - > > Key: SPARK-30855 > URL: https://issues.apache.org/jira/browse/SPARK-30855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Benoit Roy >Priority: Major > > An exception occurs when trying to use a _* expand_ selection after > performing an explode on a array of struct. > I am testing this on preview2 release of spark. > Here's a public repo containing a very simple scala test case that reproduces > the issue > {code:java} > git clone g...@github.com:benoitroy/spark-30855.git{code} > Simply execute the *Spark30855Tests* class. > On a simple schema such as: > {code:java} > root > |-- k1: string (nullable = true) > |-- k2: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- k2.k1: struct (nullable = true) > ||||-- k2.k1.k1: string (nullable = true) > ||||-- k2.k1.k2: string (nullable = true) > |||-- k2.k2: string (nullable = true) {code} > The following test case will fail on the 'col.*' selection. > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > import org.scalatest.funsuite.AnyFunSuite > class Spark38055Tests extends AnyFunSuite { > test("") { > // > val path = "src/test/data/json/data.json" > // > val spark = SparkSession > .builder() > .appName("Testing.") > .config("spark.master", "local") > .getOrCreate(); > // > val df = spark.read.json(path) > // SUCCESS! > df.printSchema() > // SUCCESS! > df.select(explode(col("k2"))).show() > // SUCCESS! > df.select(explode(col("k2"))).select("col.*").printSchema() > // FAIL! > df.select(explode(col("k2"))).select("col.*").show() > } > } {code} > > The test class demonstrates two cases, one where it fails (as shown above) > and another where it succeeds. There is only a slight variation on the > schema of both cases. The succeeding case works on the following schema: > {code:java} > root > |-- k1: string (nullable = true) > |-- k2: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- k2.k1: struct (nullable = true) > ||||-- k2.k1.k1: string (nullable = true) > |||-- k2.k2: string (nullable = true) {code} > You will notice that this schema simply removes a field from the nested > struct 'k2.k1'. > > The stacktrace produced by the failing case is show below: > {code:java} > Binding attribute, tree: _gen_alias_23#23Binding attribute, tree: > _gen_alias_23#23org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: _gen_alias_23#23 at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode
[jira] [Commented] (SPARK-30442) Write mode ignored when using CodecStreams
[ https://issues.apache.org/jira/browse/SPARK-30442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047024#comment-17047024 ] Abhishek Madav commented on SPARK-30442: In case of task failures, say the task fails to write to local-disk or is interrupted, the file is empty but materialized on the file-system. The next task which retries to write to this location would see the file and return a FileAlreadyExistException. Thus making it not resilient to task-failures. > Write mode ignored when using CodecStreams > -- > > Key: SPARK-30442 > URL: https://issues.apache.org/jira/browse/SPARK-30442 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4 >Reporter: Jesse Collins >Priority: Major > > Overwrite is hardcoded to false in the codec stream. This can cause issues, > particularly with aws tools, that make it impossible to retry. > Ideally, this should be read from the write mode set for the DataWriter that > is writing through this codec class. > [https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala#L81] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046958#comment-17046958 ] Jorge Machado commented on SPARK-26412: --- Thanks for the Tipp. It helps > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-30928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043874#comment-17043874 ] Huaxin Gao edited comment on SPARK-30928 at 2/27/20 7:55 PM: - I audited all the ML, MLLIb, GraphX-related MiMa exclusions added for 3.0: A few of them are not necessary. Will open a PR to remove those. The following are false positive: https://issues.apache.org/jira/browse/SPARK-16872 (private constructor) https://issues.apache.org/jira/browse/SPARK-25838 (private or protected member variables) https://issues.apache.org/jira/browse/SPARK-11215 (protected methods) https://issues.apache.org/jira/browse/SPARK-26616 (private constructor or private object) https://issues.apache.org/jira/browse/SPARK-25765 (private constructor) https://issues.apache.org/jira/browse/SPARK-23042 (private object) https://issues.apache.org/jira/browse/SPARK-30329 (private method and also ReversedMissingMethodProblem) The following are caused by inheritance structure change. The external APIs are still the same for users, so we don't need to document these in migration guide. https://issues.apache.org/jira/browse/SPARK-29645 (the param is moved from individual algorithm to shared Params. https://issues.apache.org/jira/browse/SPARK-28968 (the param is moved from individual algorithm to shared Params. https://issues.apache.org/jira/browse/SPARK-3037 (AFT extends Estimator -> AFT extends Regressor) Need to check migration guide for the following: Remove deprecated APIs: https://issues.apache.org/jira/browse/SPARK-28980 https://issues.apache.org/jira/browse/SPARK-27410 https://issues.apache.org/jira/browse/SPARK-26127 https://issues.apache.org/jira/browse/SPARK-26090 https://issues.apache.org/jira/browse/SPARK-25382 Binary incompatible changes https://issues.apache.org/jira/browse/SPARK-28780 https://issues.apache.org/jira/browse/SPARK-26133 https://issues.apache.org/jira/browse/SPARK-30144 https://issues.apache.org/jira/browse/SPARK-30630 was (Author: huaxingao): I audited all the ML, MLLIb, GraphX-related MiMa exclusions added for 3.0: A few of them are not necessary. Will open a PR to remove those. The following are false positive: https://issues.apache.org/jira/browse/SPARK-16872 (private constructor) https://issues.apache.org/jira/browse/SPARK-25838 (private or protected member variables) https://issues.apache.org/jira/browse/SPARK-11215 (protected methods) https://issues.apache.org/jira/browse/SPARK-26616 (private constructor or private object) https://issues.apache.org/jira/browse/SPARK-25765 (private constructor) https://issues.apache.org/jira/browse/SPARK-23042 (private object) The following are caused by inheritance structure change. The external APIs are still the same for users, so we don't need to document these in migration guide. https://issues.apache.org/jira/browse/SPARK-29645 (the param is moved from individual algorithm to shared Params. https://issues.apache.org/jira/browse/SPARK-28968 (the param is moved from individual algorithm to shared Params. https://issues.apache.org/jira/browse/SPARK-3037 (AFT extends Estimator -> AFT extends Regressor) Need to check migration guide for the following: Remove deprecated APIs: https://issues.apache.org/jira/browse/SPARK-28980 https://issues.apache.org/jira/browse/SPARK-27410 https://issues.apache.org/jira/browse/SPARK-26127 https://issues.apache.org/jira/browse/SPARK-26090 https://issues.apache.org/jira/browse/SPARK-25382 Binary incompatible changes https://issues.apache.org/jira/browse/SPARK-28780 https://issues.apache.org/jira/browse/SPARK-26133 https://issues.apache.org/jira/browse/SPARK-30144 https://issues.apache.org/jira/browse/SPARK-30630 > ML, GraphX 3.0 QA: API: Binary incompatible changes > --- > > Key: SPARK-30928 > URL: https://issues.apache.org/jira/browse/SPARK-30928 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Blocker > Fix For: 3.0.0 > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046929#comment-17046929 ] Xingbo Jiang commented on SPARK-30969: -- I created https://issues.apache.org/jira/browse/SPARK-30978 to deprecate the multiple workers on the same host support with Standalone backend. > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30978) Remove multiple workers on the same host support from Standalone backend
[ https://issues.apache.org/jira/browse/SPARK-30978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang updated SPARK-30978: - Description: Based on our experience, there is no scenario that necessarily requires deploying multiple Workers on the same node with Standalone backend. A worker should book all the resources reserved to Spark on the host it is launched, then it can allocate those resources to one or more executors launched by this worker. Since each executor runs in a separated JVM, we can limit the memory of each executor to avoid long GC pause. The remaining concern is the local-cluster mode is implemented by launching multiple workers on the local host, we might need to re-implement LocalSparkCluster to launch only one Worker and multiple executors. It should be fine because local-cluster mode is only used in running Spark unit test cases, thus end users should not be affected by this change. Removing multiple workers on the same host support could simplify the deploy model of Standalone backend, and also reduce the burden to support legacy deploy pattern in the future feature developments. The proposal is to update the document to deprecate the support of system environment `SPARK_WORKER_INSTANCES` in 3.0, and remove the support in the next major version (3.1.0). was: Based on our experience, there is no scenario that necessarily requires deploying multiple Workers on the same node with Standalone backend. A worker should book all the resources reserved to Spark on the host it is launched, then it can allocate those resources to one or more executors launched by this worker. Since each executor runs in a separated JVM, we can limit the memory of each executor to avoid long GC pause. The remaining concern is the local-cluster mode is implemented by launching multiple workers on the local host, we might need to re-implement LocalSparkCluster to launch only one Worker and multiple executors. It should be fine because local-cluster mode is only used in running Spark unit test cases, thus end users should not be affected by this change. Removing multiple workers on the same host support could simplify the deploy model of Standalone backend, and also reduce the burden to support legacy deploy pattern in the future feature developments. > Remove multiple workers on the same host support from Standalone backend > > > Key: SPARK-30978 > URL: https://issues.apache.org/jira/browse/SPARK-30978 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > > Based on our experience, there is no scenario that necessarily requires > deploying multiple Workers on the same node with Standalone backend. A worker > should book all the resources reserved to Spark on the host it is launched, > then it can allocate those resources to one or more executors launched by > this worker. Since each executor runs in a separated JVM, we can limit the > memory of each executor to avoid long GC pause. > The remaining concern is the local-cluster mode is implemented by launching > multiple workers on the local host, we might need to re-implement > LocalSparkCluster to launch only one Worker and multiple executors. It should > be fine because local-cluster mode is only used in running Spark unit test > cases, thus end users should not be affected by this change. > Removing multiple workers on the same host support could simplify the deploy > model of Standalone backend, and also reduce the burden to support legacy > deploy pattern in the future feature developments. > The proposal is to update the document to deprecate the support of system > environment `SPARK_WORKER_INSTANCES` in 3.0, and remove the support in the > next major version (3.1.0). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30978) Remove multiple workers on the same host support from Standalone backend
Xingbo Jiang created SPARK-30978: Summary: Remove multiple workers on the same host support from Standalone backend Key: SPARK-30978 URL: https://issues.apache.org/jira/browse/SPARK-30978 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.0.0, 3.1.0 Reporter: Xingbo Jiang Assignee: Xingbo Jiang Based on our experience, there is no scenario that necessarily requires deploying multiple Workers on the same node with Standalone backend. A worker should book all the resources reserved to Spark on the host it is launched, then it can allocate those resources to one or more executors launched by this worker. Since each executor runs in a separated JVM, we can limit the memory of each executor to avoid long GC pause. The remaining concern is the local-cluster mode is implemented by launching multiple workers on the local host, we might need to re-implement LocalSparkCluster to launch only one Worker and multiple executors. It should be fine because local-cluster mode is only used in running Spark unit test cases, thus end users should not be affected by this change. Removing multiple workers on the same host support could simplify the deploy model of Standalone backend, and also reduce the burden to support legacy deploy pattern in the future feature developments. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-30977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30977: -- Target Version/s: 3.0.0 > ResourceProfile and Builder should be private in spark 3.0 > -- > > Key: SPARK-30977 > URL: https://issues.apache.org/jira/browse/SPARK-30977 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > it looks like ResourceProfile and ResourceProfileBuilder accidentally got > opened up - they should be private[spark] until the stage level scheduling > feature is complete, which won't make the 3.0 release. So make them private > in 3.0 branch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-30977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-30977: - Assignee: (was: Thomas Graves) > ResourceProfile and Builder should be private in spark 3.0 > -- > > Key: SPARK-30977 > URL: https://issues.apache.org/jira/browse/SPARK-30977 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > it looks like ResourceProfile and ResourceProfileBuilder accidentally got > opened up - they should be private[spark] until the stage level scheduling > feature is complete, which won't make the 3.0 release. So make them private > in 3.0 branch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-30977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046891#comment-17046891 ] Thomas Graves commented on SPARK-30977: --- I'm working on this should have pr by end of day > ResourceProfile and Builder should be private in spark 3.0 > -- > > Key: SPARK-30977 > URL: https://issues.apache.org/jira/browse/SPARK-30977 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > it looks like ResourceProfile and ResourceProfileBuilder accidentally got > opened up - they should be private[spark] until the stage level scheduling > feature is complete, which won't make the 3.0 release. So make them private > in 3.0 branch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0
Thomas Graves created SPARK-30977: - Summary: ResourceProfile and Builder should be private in spark 3.0 Key: SPARK-30977 URL: https://issues.apache.org/jira/browse/SPARK-30977 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves Assignee: Thomas Graves it looks like ResourceProfile and ResourceProfileBuilder accidentally got opened up - they should be private[spark] until the stage level scheduling feature is complete, which won't make the 3.0 release. So make them private in 3.0 branch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-30976) Improve Maven Install Logic in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-30976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai deleted SPARK-30976: - > Improve Maven Install Logic in build/mvn > > > Key: SPARK-30976 > URL: https://issues.apache.org/jira/browse/SPARK-30976 > Project: Spark > Issue Type: Improvement >Reporter: Wesley Hsiao >Priority: Major > > The current code at lacks a validation step to test the installed maven > binary at This is a point of failure where apache jenkins machine jobs can > fail where a maven binary can fail to run due to a corrupted download from an > apache mirror. > To improve the stability of apache jenkins machine builds, a maven binary > test logic should be added after maven download to verify that the maven > binary works. If it doesn't pass the test, then download and install from > archive apache rep -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30976) Improve Maven Install Logic in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-30976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wesley Hsiao updated SPARK-30976: - Description: The current code at lacks a validation step to test the installed maven binary at This is a point of failure where apache jenkins machine jobs can fail where a maven binary can fail to run due to a corrupted download from an apache mirror. To improve the stability of apache jenkins machine builds, a maven binary test logic should be added after maven download to verify that the maven binary works. If it doesn't pass the test, then download and install from archive apache rep was: The current code at [https://github.com/databricks/runtime/blob/master/build/mvn] lacks a validation step to test the installed maven binary at [https://github.com/databricks/runtime/blob/db9c17c77bb8e46f45038a992b4f12427e2a2692/build/mvn#L88-L102.] This is a point of failure where apache jenkins machine jobs can fail where a maven binary can fail to run due to a corrupted download from an apache mirror. To improve the stability of apache jenkins machine builds, a maven binary test logic should be added after maven download to verify that the maven binary works. If it doesn't pass the test, then download and install from archive apache rep > Improve Maven Install Logic in build/mvn > > > Key: SPARK-30976 > URL: https://issues.apache.org/jira/browse/SPARK-30976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Wesley Hsiao >Priority: Major > > The current code at lacks a validation step to test the installed maven > binary at This is a point of failure where apache jenkins machine jobs can > fail where a maven binary can fail to run due to a corrupted download from an > apache mirror. > To improve the stability of apache jenkins machine builds, a maven binary > test logic should be added after maven download to verify that the maven > binary works. If it doesn't pass the test, then download and install from > archive apache rep -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30976) Improve Maven Install Logic in build/mvn
Wesley Hsiao created SPARK-30976: Summary: Improve Maven Install Logic in build/mvn Key: SPARK-30976 URL: https://issues.apache.org/jira/browse/SPARK-30976 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Wesley Hsiao The current code at [https://github.com/databricks/runtime/blob/master/build/mvn] lacks a validation step to test the installed maven binary at [https://github.com/databricks/runtime/blob/db9c17c77bb8e46f45038a992b4f12427e2a2692/build/mvn#L88-L102.] This is a point of failure where apache jenkins machine jobs can fail where a maven binary can fail to run due to a corrupted download from an apache mirror. To improve the stability of apache jenkins machine builds, a maven binary test logic should be added after maven download to verify that the maven binary works. If it doesn't pass the test, then download and install from archive apache rep -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-30951: -- Description: tl;dr: We recently discovered some Spark 2.x sites that have lots of data containing dates before October 15, 1582. This could be an issue when such sites try to upgrade to Spark 3.0. >From SPARK-26651: {quote}"The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian) {quote} We recently discovered that some large scale Spark 2.x applications rely on dates before October 15, 1582. Two cases came up recently: * An application that uses a commercial third-party library to encode sensitive dates. On insert, the library encodes the actual date as some other date. On select, the library decodes the date back to the original date. The encoded value could be any date, including one before October 15, 1582 (e.g., "0602-04-04"). * An application that uses a specific unlikely date (e.g., "1200-01-01") as a marker to indicate "unknown date" (in lieu of null) Both sites ran into problems after another component in their system was upgraded to use the proleptic Gregorian calendar. Spark applications that read files created by the upgraded component were interpreting encoded or marker dates incorrectly, and vice versa. Also, their data now had a mix of calendars (hybrid and proleptic Gregorian) with no metadata to indicate which file used which calendar. Both sites had enormous amounts of existing data, so re-encoding the dates using some other scheme was not a feasible solution. This is relevant to Spark 3: Any Spark 2 application that uses such date-encoding schemes may run into trouble when run on Spark 3. The application may not properly interpret the dates previously written by Spark 2. Also, once the Spark 3 version of the application writes data, the tables will have a mix of calendars (hybrid and proleptic gregorian) with no metadata to indicate which file uses which calendar. Similarly, sites might run with mixed Spark versions, resulting in data written by one version that cannot be interpreted by the other. And as above, the tables will now have a mix of calendars with no way to detect which file uses which calendar. As with the two real-life example cases, these applications may have enormous amounts of legacy data, so re-encoding the dates using some other scheme may not be feasible. We might want to consider a configuration setting to allow the user to specify the calendar for storing and retrieving date and timestamp values (not sure how such a flag would affect other date and timestamp-related functions). I realize the change is far bigger than just adding a configuration setting. Here's a quick example of where trouble may happen, using the real-life case of the marker date. In Spark 2.4: {noformat} scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count res0: Long = 1 scala> {noformat} In Spark 3.0 (reading from the same legacy file): {noformat} scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count res0: Long = 0 scala> {noformat} By the way, Hive had a similar problem. Hive switched from hybrid calendar to proleptic Gregorian calendar between 2.x and 3.x. After some upgrade headaches related to dates before 1582, the Hive community made the following changes: * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive checks a configuration setting to determine which calendar to use. * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive stores the calendar type in the metadata. * When reading date or timestamp data from ORC, Parquet, and Avro files, Hive checks the metadata for the calendar type. * When reading date or timestamp data from ORC, Parquet, and Avro files that lack calendar metadata, Hive's behavior is determined by a configuration setting. This allows Hive to read legacy data (note: if the data already consists of a mix of calendar types with no metadata, there is no good solution). was: >From SPARK-26651: {quote}"The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian) {quote} We recently discovered that some large scale Spark 2.x applications rely on dates before October 15, 1582. Two cases came up recently: * An application that uses a commercial third-party library to encode sensitive dates. On insert, the library encodes the actual date as some other date. On select, the library decodes the date back to the original date. The encoded value could be any date, including one before October 15, 1582 (e.g., "0602-04-04"). * An application that uses a specific unlikely date (e.g., "1200-01-01") as a marker to indicate "unknown date" (in lieu of null) Both sites ran into problems after another component in their system was upgraded t
[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails
[ https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046801#comment-17046801 ] Bryan Cutler commented on SPARK-30961: -- Yes, we should be able to keep Spark 3.x up to date with the latest pyarrow. It is currently being tested against 0.15.1 and I've tested manually with 0.16.0 also. > Arrow enabled: to_pandas with date column fails > --- > > Key: SPARK-30961 > URL: https://issues.apache.org/jira/browse/SPARK-30961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 >Reporter: Nicolas Renkamp >Priority: Major > Labels: ready-to-commit > > Hi, > there seems to be a bug in the arrow enabled to_pandas conversion from spark > dataframe to pandas dataframe when the dataframe has a column of type > DateType. Here is a minimal example to reproduce the issue: > {code:java} > spark = SparkSession.builder.getOrCreate() > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > spark_df = spark.createDataFrame( > [['2019-12-06']], 'created_at: string') \ > .withColumn('created_at', F.to_date('created_at')) > # works > spark_df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", 'true') > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > # raises AttributeError: Can only use .dt accessor with datetimelike values > # series is still of type object, .dt does not exist > spark_df.toPandas(){code} > A fix would be to modify the _check_series_convert_date function in > pyspark.sql.types to: > {code:java} > def _check_series_convert_date(series, data_type): > """ > Cast the series to datetime.date if it's a date type, otherwise returns > the original series.:param series: pandas.Series > :param data_type: a Spark data type for the series > """ > from pyspark.sql.utils import require_minimum_pandas_version > require_minimum_pandas_version()from pandas import to_datetime > if type(data_type) == DateType: > return to_datetime(series).dt.date > else: > return series > {code} > Let me know if I should prepare a Pull Request for the 2.4.5 branch. > I have not tested the behavior on master branch. > > Thanks, > Nicolas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046795#comment-17046795 ] Xiangrui Meng commented on SPARK-30969: --- [~Ngone51] [~jiangxb1987] Is there a JIRA to deprecate multiple workers running on the same host? Could you create and link here? I think we should deprecate it in 3.0. > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30969: -- Environment: (was: Resource coordination is used for the case where multiple workers running on the same host. However, it should be a rarely or event impossible use case in current Standalone(which already allow multiple executor in a single worker). We should remove support for it to simply the implementation and reduce the potential maintain cost in the future.) > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30969: -- Priority: Critical (was: Major) > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Resource coordination is used for the case where > multiple workers running on the same host. However, it should be a rarely or > event impossible use case in current Standalone(which already allow multiple > executor in a single worker). We should remove support for it to simply the > implementation and reduce the potential maintain cost in the future. >Reporter: wuyi >Priority: Critical > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-30969: - Assignee: wuyi > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Resource coordination is used for the case where > multiple workers running on the same host. However, it should be a rarely or > event impossible use case in current Standalone(which already allow multiple > executor in a single worker). We should remove support for it to simply the > implementation and reduce the potential maintain cost in the future. >Reporter: wuyi >Assignee: wuyi >Priority: Critical > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30969: -- Description: Resource coordination is used for the case where multiple workers running on the same host. However, it should be a rarely or event impossible use case in current Standalone(which already allow multiple executor in a single worker). We should remove support for it to simply the implementation and reduce the potential maintain cost in the future. > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Resource coordination is used for the case where > multiple workers running on the same host. However, it should be a rarely or > event impossible use case in current Standalone(which already allow multiple > executor in a single worker). We should remove support for it to simply the > implementation and reduce the potential maintain cost in the future. >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30975) Rename config for spark.<>.memoryOverhead
[ https://issues.apache.org/jira/browse/SPARK-30975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miquel Angel Andreu updated SPARK-30975: Summary: Rename config for spark.<>.memoryOverhead (was: Rename config to spark.executor.memoryOverhead) > Rename config for spark.<>.memoryOverhead > - > > Key: SPARK-30975 > URL: https://issues.apache.org/jira/browse/SPARK-30975 > Project: Spark > Issue Type: Task > Components: Documentation, Spark Submit >Affects Versions: 2.4.5 >Reporter: Miquel Angel Andreu >Priority: Minor > Fix For: 2.4.6 > > > The configuration for spark was changed recently and we have to keep the > consistency in the code, so we need to rename the OverHeadMemory in the code > to the new one: {{spark.executor.memoryOverhead}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30975) Rename config to spark.executor.memoryOverhead
Miquel Angel Andreu created SPARK-30975: --- Summary: Rename config to spark.executor.memoryOverhead Key: SPARK-30975 URL: https://issues.apache.org/jira/browse/SPARK-30975 Project: Spark Issue Type: Task Components: Documentation, Spark Submit Affects Versions: 2.4.5 Reporter: Miquel Angel Andreu Fix For: 2.4.6 The configuration for spark was changed recently and we have to keep the consistency in the code, so we need to rename the OverHeadMemory in the code to the new one: {{spark.executor.memoryOverhead}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28994) Document working of Adaptive
[ https://issues.apache.org/jira/browse/SPARK-28994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046710#comment-17046710 ] Takeshi Yamamuro commented on SPARK-28994: -- Adaptive? This means adaptive execution? btw, is it worth documenting this in the SQL references? > Document working of Adaptive > > > Key: SPARK-28994 > URL: https://issues.apache.org/jira/browse/SPARK-28994 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28993) Document Working of Bucketing
[ https://issues.apache.org/jira/browse/SPARK-28993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046708#comment-17046708 ] Takeshi Yamamuro commented on SPARK-28993: -- Any update? btw, is it worth documenting this in the SQL references? > Document Working of Bucketing > - > > Key: SPARK-28993 > URL: https://issues.apache.org/jira/browse/SPARK-28993 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28965) Document workings of CBO
[ https://issues.apache.org/jira/browse/SPARK-28965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046706#comment-17046706 ] Takeshi Yamamuro commented on SPARK-28965: -- Any update? btw, is it worth documenting this in the SQL references? > Document workings of CBO > > > Key: SPARK-28965 > URL: https://issues.apache.org/jira/browse/SPARK-28965 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28995) Document working of Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-28995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-28995. -- Resolution: Invalid > Document working of Spark Streaming > --- > > Key: SPARK-28995 > URL: https://issues.apache.org/jira/browse/SPARK-28995 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28995) Document working of Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-28995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046704#comment-17046704 ] Takeshi Yamamuro commented on SPARK-28995: -- I think this is not related to the SQL refs, so I'll close this. Please reopen this if any problem. > Document working of Spark Streaming > --- > > Key: SPARK-28995 > URL: https://issues.apache.org/jira/browse/SPARK-28995 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29458) Document scalar functions usage in APIs in SQL getting started.
[ https://issues.apache.org/jira/browse/SPARK-29458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046701#comment-17046701 ] Takeshi Yamamuro commented on SPARK-29458: -- [~dkbiswal] Any update? > Document scalar functions usage in APIs in SQL getting started. > --- > > Key: SPARK-29458 > URL: https://issues.apache.org/jira/browse/SPARK-29458 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30095) create function syntax has to be enhance in Doc for multiple dependent jars
[ https://issues.apache.org/jira/browse/SPARK-30095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046698#comment-17046698 ] Takeshi Yamamuro commented on SPARK-30095: -- [~abhishek.akg] Any update? > create function syntax has to be enhance in Doc for multiple dependent jars > > > Key: SPARK-30095 > URL: https://issues.apache.org/jira/browse/SPARK-30095 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Create Function Example and Syntax has to be enhance as below > 1. Case 1: How to use multiple dependent jars in the path while creating > function is not clear. -- Syntax to be given > 2. Case 2: What are the different schema supported like file:/// is not > updated in doc - Supported Schema to be provided -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-30635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30635. -- Resolution: Duplicate I think this has been resolved by SPARK-28794 ([https://github.com/apache/spark/pull/26759/files]). Please reopen it if you have any problem. > Document PARTITIONED BY Clause of CREATE statement in SQL Reference > > > Key: SPARK-30635 > URL: https://issues.apache.org/jira/browse/SPARK-30635 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30693) Document STORED AS Clause of CREATE statement in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-30693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30693. -- Resolution: Duplicate I think this has been resolved by SPARK-28794 ([https://github.com/apache/spark/pull/26759/files]). Please reopen it if you have any problem. > Document STORED AS Clause of CREATE statement in SQL Reference > -- > > Key: SPARK-30693 > URL: https://issues.apache.org/jira/browse/SPARK-30693 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-30635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046690#comment-17046690 ] Takeshi Yamamuro commented on SPARK-30635: -- [~jobitmathew] still working on it? > Document PARTITIONED BY Clause of CREATE statement in SQL Reference > > > Key: SPARK-30635 > URL: https://issues.apache.org/jira/browse/SPARK-30635 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30974) org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function.
[ https://issues.apache.org/jira/browse/SPARK-30974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akshay updated SPARK-30974: --- Summary: org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function. (was: org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;) > org.apache.spark.sql.AnalysisException: expression > 'default.udfvalidation.`empname`' is neither present in the group by, nor is > it an aggregate function. > -- > > Key: SPARK-30974 > URL: https://issues.apache.org/jira/browse/SPARK-30974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Akshay >Priority: Minor > > I'm getting the following exception while executing the query in spark 2.4.2 > !image-2020-02-27-20-07-03-701.png! > > > !image-2020-02-27-20-03-01-399.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30974) org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function. Add to group by
Akshay created SPARK-30974: -- Summary: org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Key: SPARK-30974 URL: https://issues.apache.org/jira/browse/SPARK-30974 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.2 Reporter: Akshay I'm getting the following exception while executing the query in spark 2.4.2 !image-2020-02-27-20-07-03-701.png! !image-2020-02-27-20-03-01-399.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29969) parse_url function result in incorrect result
[ https://issues.apache.org/jira/browse/SPARK-29969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045059#comment-17045059 ] YoungGyu Chun edited comment on SPARK-29969 at 2/27/20 2:16 PM: [~xiaoxigua] is this issue relating to Spark? The example provided looks like the issue of the beeline or hive. was (Author: younggyuchun): [~xiaoxigua] is this issue is relating to Spark? The example provided looks like the issue of the beeline or hive. > parse_url function result in incorrect result > - > > Key: SPARK-29969 > URL: https://issues.apache.org/jira/browse/SPARK-29969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.4 >Reporter: Victor Zhang >Priority: Major > Attachments: hive-result.jpg, spark-result.jpg > > > In this Jira using java.net.URI instead of java.net.URL for performance > reason. > https://issues.apache.org/jira/browse/SPARK-16826 > However, in the case of some unconventional parameters, it can lead to > incorrect results. > For example, when the URL is encoded, the function cannot resolve the correct > result. > > {code} > 0: jdbc:hive2://localhost:1> SELECT > parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%', > 'HOST'); > ++--+ > | > parse_url(http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%, > HOST) | > ++--+ > | NULL | > ++--+ > 1 row selected (0.094 seconds) > > hive> SELECT > parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%', > 'HOST'); > OK > HEADER: _c0 > uzzf.down.gsxzq.com > Time taken: 4.423 seconds, Fetched: 1 row(s) > {code} > > Here's a similar problem. > https://issues.apache.org/jira/browse/SPARK-23056 > Our team used the spark function to run data for months, but now we have to > run it again. > It's just too painful.:(:(:( > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30956) Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
[ https://issues.apache.org/jira/browse/SPARK-30956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30956. -- Fix Version/s: 3.0.0 Assignee: Kent Yao Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27700] > Use intercept instead of try-catch to assert failures in IntervalUtilsSuite > --- > > Key: SPARK-30956 > URL: https://issues.apache.org/jira/browse/SPARK-30956 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > Addressed the comment from > https://github.com/apache/spark/pull/27672#discussion_r383719562 to use > `intercept` instead of `try-catch` block to assert failures in the > IntervalUtilsSuite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30855) Issue using 'explode' function followed by a (*)star expand selection of resulting struct
[ https://issues.apache.org/jira/browse/SPARK-30855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046654#comment-17046654 ] Benoit Roy commented on SPARK-30855: Ok thanks for letting me know. Are there any plan for another preview release in the coming months? > Issue using 'explode' function followed by a (*)star expand selection of > resulting struct > - > > Key: SPARK-30855 > URL: https://issues.apache.org/jira/browse/SPARK-30855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Benoit Roy >Priority: Major > > An exception occurs when trying to use a _* expand_ selection after > performing an explode on a array of struct. > I am testing this on preview2 release of spark. > Here's a public repo containing a very simple scala test case that reproduces > the issue > {code:java} > git clone g...@github.com:benoitroy/spark-30855.git{code} > Simply execute the *Spark30855Tests* class. > On a simple schema such as: > {code:java} > root > |-- k1: string (nullable = true) > |-- k2: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- k2.k1: struct (nullable = true) > ||||-- k2.k1.k1: string (nullable = true) > ||||-- k2.k1.k2: string (nullable = true) > |||-- k2.k2: string (nullable = true) {code} > The following test case will fail on the 'col.*' selection. > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > import org.scalatest.funsuite.AnyFunSuite > class Spark38055Tests extends AnyFunSuite { > test("") { > // > val path = "src/test/data/json/data.json" > // > val spark = SparkSession > .builder() > .appName("Testing.") > .config("spark.master", "local") > .getOrCreate(); > // > val df = spark.read.json(path) > // SUCCESS! > df.printSchema() > // SUCCESS! > df.select(explode(col("k2"))).show() > // SUCCESS! > df.select(explode(col("k2"))).select("col.*").printSchema() > // FAIL! > df.select(explode(col("k2"))).select("col.*").show() > } > } {code} > > The test class demonstrates two cases, one where it fails (as shown above) > and another where it succeeds. There is only a slight variation on the > schema of both cases. The succeeding case works on the following schema: > {code:java} > root > |-- k1: string (nullable = true) > |-- k2: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- k2.k1: struct (nullable = true) > ||||-- k2.k1.k1: string (nullable = true) > |||-- k2.k2: string (nullable = true) {code} > You will notice that this schema simply removes a field from the nested > struct 'k2.k1'. > > The stacktrace produced by the failing case is show below: > {code:java} > Binding attribute, tree: _gen_alias_23#23Binding attribute, tree: > _gen_alias_23#23org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: _gen_alias_23#23 at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
[jira] [Resolved] (SPARK-30937) Migration guide for Hive 2.3
[ https://issues.apache.org/jira/browse/SPARK-30937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30937. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27670 [https://github.com/apache/spark/pull/27670] > Migration guide for Hive 2.3 > > > Key: SPARK-30937 > URL: https://issues.apache.org/jira/browse/SPARK-30937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Add migration guide for user after Spark upgrade built-in Hive from 1.2 to > 2.3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30937) Migration guide for Hive 2.3
[ https://issues.apache.org/jira/browse/SPARK-30937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30937: --- Assignee: wuyi > Migration guide for Hive 2.3 > > > Key: SPARK-30937 > URL: https://issues.apache.org/jira/browse/SPARK-30937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Add migration guide for user after Spark upgrade built-in Hive from 1.2 to > 2.3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30973) ScriptTransformationExec should wait for the termination of process when scriptOutputReader hasNext return false
Sun Ke created SPARK-30973: -- Summary: ScriptTransformationExec should wait for the termination of process when scriptOutputReader hasNext return false Key: SPARK-30973 URL: https://issues.apache.org/jira/browse/SPARK-30973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.4.4 Reporter: Sun Ke -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30972) PruneHiveTablePartitions should be executed as earlyScanPushDownRules
wuyi created SPARK-30972: Summary: PruneHiveTablePartitions should be executed as earlyScanPushDownRules Key: SPARK-30972 URL: https://issues.apache.org/jira/browse/SPARK-30972 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Similar to PruneFileSourcePartitions, PruneHiveTablePartitions should also be executed as earlyScanPushDownRules to eliminate the impact on statistic computation later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30971) Support MySQL Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-30971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi resolved SPARK-30971. --- Resolution: Won't Do > Support MySQL Kerberos login in JDBC connector > -- > > Key: SPARK-30971 > URL: https://issues.apache.org/jira/browse/SPARK-30971 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30971) Support MySQL Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-30971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046550#comment-17046550 ] Gabor Somogyi commented on SPARK-30971: --- Just for the record I've created this jira since MySQL doesn't provide kerberos authentication at the moment. > Support MySQL Kerberos login in JDBC connector > -- > > Key: SPARK-30971 > URL: https://issues.apache.org/jira/browse/SPARK-30971 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30971) Support MySQL Kerberos login in JDBC connector
Gabor Somogyi created SPARK-30971: - Summary: Support MySQL Kerberos login in JDBC connector Key: SPARK-30971 URL: https://issues.apache.org/jira/browse/SPARK-30971 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046542#comment-17046542 ] Hyukjin Kwon edited comment on SPARK-26412 at 2/27/20 12:03 PM: You cannot separate one iterator to multiple iterators since iterator is supposed to be consumed once. Python doesn't support this way. You should do something like {code} class SomeClass(): def __init__(a,b,c): pass def map_func(batch_iter): for a, b, c in batch_iter dataset = SomeClass(a, b, c) {code} You can just pass strings and do {{json.loads}} which is pretty easy. There is no standard type for JSON in Spark which isn't ANSI standard. Adding new types is a huge job because you should think about how to ser/de in Scala, R, Java for instance. was (Author: hyukjin.kwon): You cannot separate one iterator to multiple iterators since iterator is supposed to be consumed once. Python doesn't support this way. You should do something like {code} class SomeClass(): def __init__(a,b,c): pass def map_func(batch_iter): for a, b, c in batch_iter dataset = SomeClass(a, b, c) {code} You can just pass strings and do {{json.loads}} which is pretty easy. There is no standard type for JSON in Spark which isn't ANSI standard. Adding new types is a huge job because you should think about how to ser/de in Scala, R, Java for instance. > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046542#comment-17046542 ] Hyukjin Kwon commented on SPARK-26412: -- You cannot separate one iterator to multiple iterators since iterator is supposed to be consumed once. Python doesn't support this way. You should do something like {code} class SomeClass(): def __init__(a,b,c): pass def map_func(batch_iter): for a, b, c in batch_iter dataset = SomeClass(a, b, c) {code} You can just pass strings and do {{json.loads}} which is pretty easy. There is no standard type for JSON in Spark which isn't ANSI standard. Adding new types is a huge job because you should think about how to ser/de in Scala, R, Java for instance. > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30970) Fix NPE in resolving k8s master url
[ https://issues.apache.org/jira/browse/SPARK-30970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046454#comment-17046454 ] Dongjoon Hyun commented on SPARK-30970: --- BTW, [~Qin Yao]. Could you check 2.3.4 behavior and update the Affected Version if needed? > Fix NPE in resolving k8s master url > --- > > Key: SPARK-30970 > URL: https://issues.apache.org/jira/browse/SPARK-30970 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Minor > > {code:java} > ``` > bin/spark-sql --master k8s:///https://kubernetes.docker.internal:6443 --conf > spark.kubernetes.container.image=yaooqinn/spark:v2.4.4 > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739) > at > org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > ``` > {code} > Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master > url but should not throw npe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30970) Fix NPE in resolving k8s master url
[ https://issues.apache.org/jira/browse/SPARK-30970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046453#comment-17046453 ] Dongjoon Hyun commented on SPARK-30970: --- Since the root cause is the user mistakes, the prevention logic will be a minor bug fix. > Fix NPE in resolving k8s master url > --- > > Key: SPARK-30970 > URL: https://issues.apache.org/jira/browse/SPARK-30970 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Minor > > {code:java} > ``` > bin/spark-sql --master k8s:///https://kubernetes.docker.internal:6443 --conf > spark.kubernetes.container.image=yaooqinn/spark:v2.4.4 > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739) > at > org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > ``` > {code} > Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master > url but should not throw npe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30970) Fix NPE in resolving k8s master url
[ https://issues.apache.org/jira/browse/SPARK-30970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30970: -- Priority: Minor (was: Major) > Fix NPE in resolving k8s master url > --- > > Key: SPARK-30970 > URL: https://issues.apache.org/jira/browse/SPARK-30970 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Minor > > {code:java} > ``` > bin/spark-sql --master k8s:///https://kubernetes.docker.internal:6443 --conf > spark.kubernetes.container.image=yaooqinn/spark:v2.4.4 > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739) > at > org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > ``` > {code} > Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master > url but should not throw npe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30929) ML, GraphX 3.0 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-30929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30929: - Environment: (was: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue) > ML, GraphX 3.0 QA: API: New Scala APIs, docs > > > Key: SPARK-30929 > URL: https://issues.apache.org/jira/browse/SPARK-30929 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30929) ML, GraphX 3.0 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-30929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30929: - Description: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue > ML, GraphX 3.0 QA: API: New Scala APIs, docs > > > Key: SPARK-30929 > URL: https://issues.apache.org/jira/browse/SPARK-30929 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046228#comment-17046228 ] zhengruifeng edited comment on SPARK-30932 at 2/27/20 9:47 AM: --- I checked added classes from {{added_ml_class:}} * FMClassifier, FMRegressor has related Java example and doc; * RobustScaler has related Java example and doc; * MultilabelClassificationEvaluator,RankingEvaluator do not have related Java examples; However, other evaluators do not have examples, either; *We may need to add some basic description in doc/ml-tuning.* * -org.apache.spark.ml.functions has no related doc, is only used in {{FunctionsSuite}}; *I am not sure we should make it public;*- * -org.apache.spark.ml.\{FitStart, FitEnd, LoadInstanceStart, LoadInstanceEnd, SaveInstanceStart, SaveInstanceEnd, TransformStart, TransformEnd} are marked {{Unstable}} and has no related doc;- was (Author: podongfeng): I checked added classes from {{added_ml_class:}} * FMClassifier, FMRegressor has related Java example and doc; * RobustScaler has related Java example and doc; * MultilabelClassificationEvaluator,RankingEvaluator do not have related Java examples; However, other evaluators do not have examples, either; *We may need to add some basic description in doc/ml-tuning.* * org.apache.spark.ml.functions has no related doc, is only used in \{{FunctionsSuite}}; *I am not sure we should make it public;* * org.apache.spark.ml.\{FitStart, FitEnd, LoadInstanceStart, LoadInstanceEnd, SaveInstanceStart, SaveInstanceEnd, TransformStart, TransformEnd} are marked \{{Unstable}} and has no related doc; > ML 3.0 QA: API: Java compatibility, docs > > > Key: SPARK-30932 > URL: https://issues.apache.org/jira/browse/SPARK-30932 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > Attachments: 1_process_script.sh, added_ml_class, common_ml_class, > signature.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30970) Fix NPE in resolving k8s master url
Kent Yao created SPARK-30970: Summary: Fix NPE in resolving k8s master url Key: SPARK-30970 URL: https://issues.apache.org/jira/browse/SPARK-30970 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Core Affects Versions: 2.4.5, 3.0.0, 3.1.0 Reporter: Kent Yao {code:java} ``` bin/spark-sql --master k8s:///https://kubernetes.docker.internal:6443 --conf spark.kubernetes.container.image=yaooqinn/spark:v2.4.4 Exception in thread "main" java.lang.NullPointerException at org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` {code} Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master url but should not throw npe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30969) Remove resource coordination support from Standalone
wuyi created SPARK-30969: Summary: Remove resource coordination support from Standalone Key: SPARK-30969 URL: https://issues.apache.org/jira/browse/SPARK-30969 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Environment: Resource coordination is used for the case where multiple workers running on the same host. However, it should be a rarely or event impossible use case in current Standalone(which already allow multiple executor in a single worker). We should remove support for it to simply the implementation and reduce the potential maintain cost in the future. Reporter: wuyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046373#comment-17046373 ] Jorge Machado commented on SPARK-26412: --- Well I was thinking on something more. like I would like to give a,b,c to another object. Like {code:java} class SomeClass(): def __init__(a,b,c): pass def map_func(batch_iter): dataset = SomeClass(batch_iter[0], batch_iter[1], batch_iter[2]) <- this does not work. {code} and another thing, it would be great if we could just yield a json for example instead of this fixed types > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046365#comment-17046365 ] Hyukjin Kwon commented on SPARK-26412: -- You can do it via: {code} def map_func(batch_iter): for a, b, c in batch_iter: yield a, b, c {code} > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29969) parse_url function result in incorrect result
[ https://issues.apache.org/jira/browse/SPARK-29969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046359#comment-17046359 ] Victor Zhang commented on SPARK-29969: -- [~younggyuchun] The description may be a bit confusing. I use beeline to connect spark thrift server. The parse_url function in spark depend on java.net.URI while java.net.URL in hive. {code:java} spark-sql> SELECT parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%', 'HOST'); NULL Time taken: 1.211 seconds, Fetched 1 row(s) {code} {code:java} hive> SELECT parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%', 'HOST'); OK HEADER: _c0 uzzf.down.gsxzq.com Time taken: 0.039 seconds, Fetched: 1 row(s) {code} > parse_url function result in incorrect result > - > > Key: SPARK-29969 > URL: https://issues.apache.org/jira/browse/SPARK-29969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.4 >Reporter: Victor Zhang >Priority: Major > Attachments: hive-result.jpg, spark-result.jpg > > > In this Jira using java.net.URI instead of java.net.URL for performance > reason. > https://issues.apache.org/jira/browse/SPARK-16826 > However, in the case of some unconventional parameters, it can lead to > incorrect results. > For example, when the URL is encoded, the function cannot resolve the correct > result. > > {code} > 0: jdbc:hive2://localhost:1> SELECT > parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%', > 'HOST'); > ++--+ > | > parse_url(http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%, > HOST) | > ++--+ > | NULL | > ++--+ > 1 row selected (0.094 seconds) > > hive> SELECT > parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%', > 'HOST'); > OK > HEADER: _c0 > uzzf.down.gsxzq.com > Time taken: 4.423 seconds, Fetched: 1 row(s) > {code} > > Here's a similar problem. > https://issues.apache.org/jira/browse/SPARK-23056 > Our team used the spark function to run data for months, but now we have to > run it again. > It's just too painful.:(:(:( > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails
[ https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046337#comment-17046337 ] Nicolas Renkamp commented on SPARK-30961: - [~bryanc] Thanks for the quick reply. Actually, I was using pyarrow 0.15.1 in this example. I did not really realize that spark 2.4.X should be used together with version 0.8.0 or highest 0.11.1 of pyarrow. Thanks for the background information and the links to the other issues. >From a user's perspective it would be great of spark 3.X would be compatible >with the latest pyarrow version. Are you aiming for that? > Arrow enabled: to_pandas with date column fails > --- > > Key: SPARK-30961 > URL: https://issues.apache.org/jira/browse/SPARK-30961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 >Reporter: Nicolas Renkamp >Priority: Major > Labels: ready-to-commit > > Hi, > there seems to be a bug in the arrow enabled to_pandas conversion from spark > dataframe to pandas dataframe when the dataframe has a column of type > DateType. Here is a minimal example to reproduce the issue: > {code:java} > spark = SparkSession.builder.getOrCreate() > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > spark_df = spark.createDataFrame( > [['2019-12-06']], 'created_at: string') \ > .withColumn('created_at', F.to_date('created_at')) > # works > spark_df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", 'true') > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > # raises AttributeError: Can only use .dt accessor with datetimelike values > # series is still of type object, .dt does not exist > spark_df.toPandas(){code} > A fix would be to modify the _check_series_convert_date function in > pyspark.sql.types to: > {code:java} > def _check_series_convert_date(series, data_type): > """ > Cast the series to datetime.date if it's a date type, otherwise returns > the original series.:param series: pandas.Series > :param data_type: a Spark data type for the series > """ > from pyspark.sql.utils import require_minimum_pandas_version > require_minimum_pandas_version()from pandas import to_datetime > if type(data_type) == DateType: > return to_datetime(series).dt.date > else: > return series > {code} > Let me know if I should prepare a Pull Request for the 2.4.5 branch. > I have not tested the behavior on master branch. > > Thanks, > Nicolas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org