[jira] [Updated] (SPARK-43491) In expression not compatible with EqualTo Expression
[ https://issues.apache.org/jira/browse/SPARK-43491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KuijianLiu updated SPARK-43491: --- Description: The query results of Spark SQL 3.1.1 and Hive SQL 3.1.0 are inconsistent with same sql. Spark SQL calculates `{{{}0 in ('00')`{}}} as false, which act different from `{{{}=`{}}} keyword, but Hive calculates true. Hive is compatible with the `{{{}in`{}}} keyword in 3.1.0, but SparkSQL does not. It's better when dataTypes of elements in `{{{}In`{}}} expression are the same, it should behaviour as same as BinaryComparison like ` {{{}EqualTo`{}}}. Test SQL: {code:java} scala> spark.sql("select 1 as test where 0 = '00'").show ++ |test| ++ | 1| ++ scala> spark.sql("select 1 as test where 0 in ('00')").show ++ |test| ++ ++{code} !image-2023-05-13-13-14-55-853.png! was: The query results of Spark SQL 3.1.1 and Hive SQL 3.1.0 are inconsistent with same sql. Spark SQL calculates `{{{}0 in ('00')`{}}} as false, which act different from `{{{}=`{}}} keyword, but Hive calculates true. Hive is compatible with the `{{{}in`{}}} keyword in 3.1.0, but SparkSQL does not. It's better when dataTypes of elements in `{{{}In`{}}} expression are the same, it should behaviour as same as BinaryComparison like ` {{{}EqualTo`{}}}. Test SQL: {code:java} scala> spark.sql("select 1 as test where 0 = '00'").show ++ |test| ++ | 1| ++ scala> spark.sql("select 1 as test where 0 in ('00')").show ++ |test| ++ ++{code} !image-2023-05-13-13-15-50-685.png! !image-2023-05-13-13-14-55-853.png! > In expression not compatible with EqualTo Expression > > > Key: SPARK-43491 > URL: https://issues.apache.org/jira/browse/SPARK-43491 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: KuijianLiu >Priority: Major > Attachments: image-2023-05-13-13-14-55-853.png, > image-2023-05-13-13-15-50-685.png > > > The query results of Spark SQL 3.1.1 and Hive SQL 3.1.0 are inconsistent > with same sql. Spark SQL calculates `{{{}0 in ('00')`{}}} as false, which act > different from `{{{}=`{}}} keyword, but Hive calculates true. Hive is > compatible with the `{{{}in`{}}} keyword in 3.1.0, but SparkSQL does not. > It's better when dataTypes of elements in `{{{}In`{}}} expression are the > same, it should behaviour as same as BinaryComparison like ` {{{}EqualTo`{}}}. > Test SQL: > {code:java} > scala> spark.sql("select 1 as test where 0 = '00'").show > ++ > |test| > ++ > | 1| > ++ > scala> spark.sql("select 1 as test where 0 in ('00')").show > ++ > |test| > ++ > ++{code} > > !image-2023-05-13-13-14-55-853.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
[ https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43522: --- Assignee: Jia Fan > Creating struct column occurs error 'org.apache.spark.sql.AnalysisException > [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]' > - > > Key: SPARK-43522 > URL: https://issues.apache.org/jira/browse/SPARK-43522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Heedo Lee >Assignee: Jia Fan >Priority: Minor > > When creating a struct column in Dataframe, the code that ran without > problems in version 3.3.1 does not work in version 3.4.0. > > Example > {code:java} > val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, > ",")).withColumn("map_entry", transform(col("key_value"), x => > struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} > > In 3.3.1 > > {code:java} > > testDF.show() > +---+---++ > | value| key_value| map_entry| > +---+---++ > |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| > +---+---++ > > testDF.printSchema() > root > |-- value: string (nullable = true) > |-- key_value: array (nullable = true) > | |-- element: string (containsNull = false) > |-- map_entry: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- col1: string (nullable = true) > | | |-- col2: string (nullable = true) > {code} > > > In 3.4.0 > > {code:java} > org.apache.spark.sql.AnalysisException: > [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot > resolve "struct(split(namedlambdavariable(), =, -1)[0], > split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only > foldable `STRING` expressions are allowed to appear at odd position, but they > are ["0", "1"].; > 'Project [value#41, key_value#45, transform(key_value#45, > lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda > x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] > +- Project [value#41, split(value#41, ,, -1) AS key_value#45] > +- LocalRelation [value#41] at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > > > {code} > > However, if you do an alias to struct elements, you can get the same result > as the previous version. > > {code:java} > val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, > ",")).withColumn("map_entry", transform(col("key_value"), x => > struct(split(x, "=").getItem(0).as("col1") , split(x, > "=").getItem(1).as("col2") ) )){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
[ https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43522. - Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 41187 [https://github.com/apache/spark/pull/41187] > Creating struct column occurs error 'org.apache.spark.sql.AnalysisException > [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]' > - > > Key: SPARK-43522 > URL: https://issues.apache.org/jira/browse/SPARK-43522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Heedo Lee >Assignee: Jia Fan >Priority: Minor > Fix For: 3.5.0, 3.4.1 > > > When creating a struct column in Dataframe, the code that ran without > problems in version 3.3.1 does not work in version 3.4.0. > > Example > {code:java} > val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, > ",")).withColumn("map_entry", transform(col("key_value"), x => > struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} > > In 3.3.1 > > {code:java} > > testDF.show() > +---+---++ > | value| key_value| map_entry| > +---+---++ > |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| > +---+---++ > > testDF.printSchema() > root > |-- value: string (nullable = true) > |-- key_value: array (nullable = true) > | |-- element: string (containsNull = false) > |-- map_entry: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- col1: string (nullable = true) > | | |-- col2: string (nullable = true) > {code} > > > In 3.4.0 > > {code:java} > org.apache.spark.sql.AnalysisException: > [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot > resolve "struct(split(namedlambdavariable(), =, -1)[0], > split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only > foldable `STRING` expressions are allowed to appear at odd position, but they > are ["0", "1"].; > 'Project [value#41, key_value#45, transform(key_value#45, > lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda > x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] > +- Project [value#41, split(value#41, ,, -1) AS key_value#45] > +- LocalRelation [value#41] at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > > > {code} > > However, if you do an alias to struct elements, you can get the same result > as the previous version. > > {code:java} > val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, > ",")).withColumn("map_entry", transform(col("key_value"), x => > struct(split(x, "=").getItem(0).as("col1") , split(x, > "=").getItem(1).as("col2") ) )){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43157) TreeNode tags can become corrupted and hang driver when the dataset is cached
[ https://issues.apache.org/jira/browse/SPARK-43157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43157: --- Assignee: Rob Reeves > TreeNode tags can become corrupted and hang driver when the dataset is cached > - > > Key: SPARK-43157 > URL: https://issues.apache.org/jira/browse/SPARK-43157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 3.5.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Major > > If a cached dataset is used by multiple other datasets materialized in > separate threads it can corrupt the TreeNode.tags map in any of the cached > plan nodes. This will hang the driver forever. This happens because > TreeNode.tags is not thread-safe. How this happens: > # Multiple datasets are materialized at the same time in different threads > that reference the same cached dataset > # AdaptiveSparkPlanExec.onUpdatePlan will call ExplainMode.fromString > # ExplainUtils uses the TreeNode.tags map to store the operator Id for every > node in the plan. This is usually okay because the plan is cloned. When there > is an InMemoryScanExec the InMemoryRelation.cachedPlan is not cloned so > multiple threads can set the operator Id. > Making the TreeNode.tags field thread-safe does not solve this problem > because there is still a correctness issue. The threads may be overwriting > each other's operator Ids, which could be different. > Example stack trace of the infinite loop: > {code:scala} > scala.collection.mutable.HashTable.resize(HashTable.scala:265) > scala.collection.mutable.HashTable.addEntry0(HashTable.scala:158) > scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:170) > scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167) > scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44) > scala.collection.mutable.HashMap.put(HashMap.scala:126) > scala.collection.mutable.HashMap.update(HashMap.scala:131) > org.apache.spark.sql.catalyst.trees.TreeNode.setTagValue(TreeNode.scala:108) > org.apache.spark.sql.execution.ExplainUtils$.setOpId$1(ExplainUtils.scala:134) > … > org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:662){code} > Example to show the cachedPlan object is not cloned: > {code:java} > import org.apache.spark.sql.execution.SparkPlan > import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec > import spark.implicits._ > def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = { > if (plan.isInstanceOf[InMemoryTableScanExec]) { > Some(plan.asInstanceOf[InMemoryTableScanExec]) > } else if (plan.children.isEmpty && plan.subqueries.isEmpty) { > None > } else { > (plan.subqueries.flatMap(p => findCacheOperator(p)) ++ > plan.children.flatMap(findCacheOperator)).headOption > } > } > val df = spark.range(10).filter($"id" < 100).cache() > val df1 = df.limit(1) > val df2 = df.limit(1) > // Get the cache operator (InMemoryTableScanExec) in each plan > val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get > val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get > // Check if InMemoryTableScanExec references point to the same object > println(plan1.eq(plan2)) > // returns false// Check if InMemoryRelation references point to the same > object > println(plan1.relation.eq(plan2.relation)) > // returns false > // Check if the cached SparkPlan references point to the same object > println(plan1.relation.cachedPlan.eq(plan2.relation.cachedPlan)) > // returns true > // This shows that the cloned plan2 still has references to the original > plan1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43157) TreeNode tags can become corrupted and hang driver when the dataset is cached
[ https://issues.apache.org/jira/browse/SPARK-43157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43157. - Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 40812 [https://github.com/apache/spark/pull/40812] > TreeNode tags can become corrupted and hang driver when the dataset is cached > - > > Key: SPARK-43157 > URL: https://issues.apache.org/jira/browse/SPARK-43157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 3.5.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Major > Fix For: 3.5.0, 3.4.1 > > > If a cached dataset is used by multiple other datasets materialized in > separate threads it can corrupt the TreeNode.tags map in any of the cached > plan nodes. This will hang the driver forever. This happens because > TreeNode.tags is not thread-safe. How this happens: > # Multiple datasets are materialized at the same time in different threads > that reference the same cached dataset > # AdaptiveSparkPlanExec.onUpdatePlan will call ExplainMode.fromString > # ExplainUtils uses the TreeNode.tags map to store the operator Id for every > node in the plan. This is usually okay because the plan is cloned. When there > is an InMemoryScanExec the InMemoryRelation.cachedPlan is not cloned so > multiple threads can set the operator Id. > Making the TreeNode.tags field thread-safe does not solve this problem > because there is still a correctness issue. The threads may be overwriting > each other's operator Ids, which could be different. > Example stack trace of the infinite loop: > {code:scala} > scala.collection.mutable.HashTable.resize(HashTable.scala:265) > scala.collection.mutable.HashTable.addEntry0(HashTable.scala:158) > scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:170) > scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167) > scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44) > scala.collection.mutable.HashMap.put(HashMap.scala:126) > scala.collection.mutable.HashMap.update(HashMap.scala:131) > org.apache.spark.sql.catalyst.trees.TreeNode.setTagValue(TreeNode.scala:108) > org.apache.spark.sql.execution.ExplainUtils$.setOpId$1(ExplainUtils.scala:134) > … > org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:662){code} > Example to show the cachedPlan object is not cloned: > {code:java} > import org.apache.spark.sql.execution.SparkPlan > import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec > import spark.implicits._ > def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = { > if (plan.isInstanceOf[InMemoryTableScanExec]) { > Some(plan.asInstanceOf[InMemoryTableScanExec]) > } else if (plan.children.isEmpty && plan.subqueries.isEmpty) { > None > } else { > (plan.subqueries.flatMap(p => findCacheOperator(p)) ++ > plan.children.flatMap(findCacheOperator)).headOption > } > } > val df = spark.range(10).filter($"id" < 100).cache() > val df1 = df.limit(1) > val df2 = df.limit(1) > // Get the cache operator (InMemoryTableScanExec) in each plan > val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get > val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get > // Check if InMemoryTableScanExec references point to the same object > println(plan1.eq(plan2)) > // returns false// Check if InMemoryRelation references point to the same > object > println(plan1.relation.eq(plan2.relation)) > // returns false > // Check if the cached SparkPlan references point to the same object > println(plan1.relation.cachedPlan.eq(plan2.relation.cachedPlan)) > // returns true > // This shows that the cloned plan2 still has references to the original > plan1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43571) Enable DateOpsTests.test_sub for pandas 2.0.0.
Haejoon Lee created SPARK-43571: --- Summary: Enable DateOpsTests.test_sub for pandas 2.0.0. Key: SPARK-43571 URL: https://issues.apache.org/jira/browse/SPARK-43571 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DateOpsTests.test_sub for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43570) Enable DateOpsTests.test_rsub for pandas 2.0.0.
Haejoon Lee created SPARK-43570: --- Summary: Enable DateOpsTests.test_rsub for pandas 2.0.0. Key: SPARK-43570 URL: https://issues.apache.org/jira/browse/SPARK-43570 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DateOpsTests.test_rsub for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43569) Remove workaround for HADOOP-14067
BingKun Pan created SPARK-43569: --- Summary: Remove workaround for HADOOP-14067 Key: SPARK-43569 URL: https://issues.apache.org/jira/browse/SPARK-43569 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43568) Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.
Haejoon Lee created SPARK-43568: --- Summary: Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0. Key: SPARK-43568 URL: https://issues.apache.org/jira/browse/SPARK-43568 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43548) Remove workaround for HADOOP-16255
[ https://issues.apache.org/jira/browse/SPARK-43548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43548: - Assignee: BingKun Pan > Remove workaround for HADOOP-16255 > -- > > Key: SPARK-43548 > URL: https://issues.apache.org/jira/browse/SPARK-43548 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43567) Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.
Haejoon Lee created SPARK-43567: --- Summary: Enable CategoricalIndexTests.test_factorize for pandas 2.0.0. Key: SPARK-43567 URL: https://issues.apache.org/jira/browse/SPARK-43567 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable CategoricalIndexTests.test_factorize for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43548) Remove workaround for HADOOP-16255
[ https://issues.apache.org/jira/browse/SPARK-43548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43548. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41209 [https://github.com/apache/spark/pull/41209] > Remove workaround for HADOOP-16255 > -- > > Key: SPARK-43548 > URL: https://issues.apache.org/jira/browse/SPARK-43548 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43566) Enable CategoricalTests.test_categories_setter for pandas 2.0.0.
Haejoon Lee created SPARK-43566: --- Summary: Enable CategoricalTests.test_categories_setter for pandas 2.0.0. Key: SPARK-43566 URL: https://issues.apache.org/jira/browse/SPARK-43566 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable CategoricalTests.test_categories_setter for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43565) Enable CategoricalTests.test_as_ordered_unordered for pandas 2.0.0.
Haejoon Lee created SPARK-43565: --- Summary: Enable CategoricalTests.test_as_ordered_unordered for pandas 2.0.0. Key: SPARK-43565 URL: https://issues.apache.org/jira/browse/SPARK-43565 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable CategoricalTests.test_as_ordered_unordered for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43564) Enable CategoricalTests.test_factorize for pandas 2.0.0.
Haejoon Lee created SPARK-43564: --- Summary: Enable CategoricalTests.test_factorize for pandas 2.0.0. Key: SPARK-43564 URL: https://issues.apache.org/jira/browse/SPARK-43564 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable CategoricalTests.test_factorize for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43563) Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0.
Haejoon Lee created SPARK-43563: --- Summary: Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0. Key: SPARK-43563 URL: https://issues.apache.org/jira/browse/SPARK-43563 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43562) Enable DataFrameTests.test_append for pandas 2.0.0.
Haejoon Lee created SPARK-43562: --- Summary: Enable DataFrameTests.test_append for pandas 2.0.0. Key: SPARK-43562 URL: https://issues.apache.org/jira/browse/SPARK-43562 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameTests.test_append for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43561) Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0.
Haejoon Lee created SPARK-43561: --- Summary: Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0. Key: SPARK-43561 URL: https://issues.apache.org/jira/browse/SPARK-43561 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43559) Enable DataFrameSlowTests.test_iteritems for pandas 2.0.0.
Haejoon Lee created SPARK-43559: --- Summary: Enable DataFrameSlowTests.test_iteritems for pandas 2.0.0. Key: SPARK-43559 URL: https://issues.apache.org/jira/browse/SPARK-43559 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameSlowTests.test_iteritems for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43560) Enable DataFrameSlowTests.test_mad for pandas 2.0.0.
Haejoon Lee created SPARK-43560: --- Summary: Enable DataFrameSlowTests.test_mad for pandas 2.0.0. Key: SPARK-43560 URL: https://issues.apache.org/jira/browse/SPARK-43560 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameSlowTests.test_mad for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43558) Enable DataFrameSlowTests.test_product for pandas 2.0.0.
Haejoon Lee created SPARK-43558: --- Summary: Enable DataFrameSlowTests.test_product for pandas 2.0.0. Key: SPARK-43558 URL: https://issues.apache.org/jira/browse/SPARK-43558 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameSlowTests.test_product for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43557) Enable DataFrameSlowTests.test_between_time for pandas 2.0.0.
Haejoon Lee created SPARK-43557: --- Summary: Enable DataFrameSlowTests.test_between_time for pandas 2.0.0. Key: SPARK-43557 URL: https://issues.apache.org/jira/browse/SPARK-43557 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameSlowTests.test_between_time for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43556) Enable DataFrameSlowTests.test_describe for pandas 2.0.0.
Haejoon Lee created SPARK-43556: --- Summary: Enable DataFrameSlowTests.test_describe for pandas 2.0.0. Key: SPARK-43556 URL: https://issues.apache.org/jira/browse/SPARK-43556 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable DataFrameSlowTests.test_describe for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43555) Enable GroupByTests.test_groupby_multiindex_columns for pandas 2.0.0.
Haejoon Lee created SPARK-43555: --- Summary: Enable GroupByTests.test_groupby_multiindex_columns for pandas 2.0.0. Key: SPARK-43555 URL: https://issues.apache.org/jira/browse/SPARK-43555 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable GroupByTests.test_groupby_multiindex_columns for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43554) Enable GroupByTests.test_basic_stat_funcs for pandas 2.0.0.
Haejoon Lee created SPARK-43554: --- Summary: Enable GroupByTests.test_basic_stat_funcs for pandas 2.0.0. Key: SPARK-43554 URL: https://issues.apache.org/jira/browse/SPARK-43554 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable GroupByTests.test_basic_stat_funcs for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43553) Enable GroupByTests.test_mad for pandas 2.0.0.
Haejoon Lee created SPARK-43553: --- Summary: Enable GroupByTests.test_mad for pandas 2.0.0. Key: SPARK-43553 URL: https://issues.apache.org/jira/browse/SPARK-43553 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable GroupByTests.test_mad for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43552) Enable GroupByTests.test_nth for pandas 2.0.0.
Haejoon Lee created SPARK-43552: --- Summary: Enable GroupByTests.test_nth for pandas 2.0.0. Key: SPARK-43552 URL: https://issues.apache.org/jira/browse/SPARK-43552 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable GroupByTests.test_nth for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43551) Enable GroupByTests.test_prod for pandas 2.0.0.
Haejoon Lee created SPARK-43551: --- Summary: Enable GroupByTests.test_prod for pandas 2.0.0. Key: SPARK-43551 URL: https://issues.apache.org/jira/browse/SPARK-43551 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable GroupByTests.test_prod for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43550) Enable SeriesTests.test_factorize for pandas 2.0.0.
Haejoon Lee created SPARK-43550: --- Summary: Enable SeriesTests.test_factorize for pandas 2.0.0. Key: SPARK-43550 URL: https://issues.apache.org/jira/browse/SPARK-43550 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable SeriesTests.test_factorize for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0035
[ https://issues.apache.org/jira/browse/SPARK-43549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723779#comment-17723779 ] BingKun Pan commented on SPARK-43549: - I work on it. > Assign a name to the error class _LEGACY_ERROR_TEMP_0035 > > > Key: SPARK-43549 > URL: https://issues.apache.org/jira/browse/SPARK-43549 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0035
BingKun Pan created SPARK-43549: --- Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0035 Key: SPARK-43549 URL: https://issues.apache.org/jira/browse/SPARK-43549 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43547) Update "Supported Pandas API" page to point out the proper pandas docs
[ https://issues.apache.org/jira/browse/SPARK-43547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43547: Assignee: Haejoon Lee > Update "Supported Pandas API" page to point out the proper pandas docs > -- > > Key: SPARK-43547 > URL: https://issues.apache.org/jira/browse/SPARK-43547 > Project: Spark > Issue Type: Bug > Components: Documentation, Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > [https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api] > not point out the wrong pandas version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43547) Update "Supported Pandas API" page to point out the proper pandas docs
[ https://issues.apache.org/jira/browse/SPARK-43547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43547. -- Fix Version/s: 3.4.1 Resolution: Fixed Issue resolved by pull request 41208 [https://github.com/apache/spark/pull/41208] > Update "Supported Pandas API" page to point out the proper pandas docs > -- > > Key: SPARK-43547 > URL: https://issues.apache.org/jira/browse/SPARK-43547 > Project: Spark > Issue Type: Bug > Components: Documentation, Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.1 > > > [https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api] > not point out the wrong pandas version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723777#comment-17723777 ] Shuaipeng Lee edited comment on SPARK-40964 at 5/18/23 2:53 AM: Thanks for your commits. I rebuild the hadoop-client-api and can start history server seccessfully. I change the pom.xml of hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following config javax/servlet/ ${shaded.dependency.prefix}.javax.servlet. **/pom.xml build hadoop-client-api mvn package -DskipTests was (Author: bigboy001): Thanks for your commits. I rebuild the hadoop-client-api and can start history server seccessfully. I change the pom.xml of hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following config ```xml javax/servlet/ ${shaded.dependency.prefix}.javax.servlet. **/pom.xml ``` build hadoop-client-api ```shell mvn package -DskipTests ``` > Cannot run spark history server with shaded hadoop jar > -- > > Key: SPARK-40964 > URL: https://issues.apache.org/jira/browse/SPARK-40964 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.2 >Reporter: YUBI LEE >Priority: Major > > Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. > If you try to start Spark History Server with shaded client jars and enable > security using > org.apache.hadoop.security.authentication.server.AuthenticationFilter, you > will meet following exception. > {code} > # spark-env.sh > export > SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter > > -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"' > {code} > {code} > # spark history server's out file > 22/10/27 15:29:48 INFO AbstractConnector: Started > ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} > 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' > on port 18081. > 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: > org.apache.hadoop.security.authentication.server.AuthenticationFilter > 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer > java.lang.IllegalStateException: class > org.apache.hadoop.security.authentication.server.AuthenticationFilter is not > a javax.servlet.Filter > at > org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) > at > org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) > at > org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) > at > java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at > org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) > at > org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) > at > org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) > at > org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) > at > org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) > at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) > at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) > at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) > at > org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) > at > org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) > {code} > I think "AuthenticationFilter" in the shaded jar imports > "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". > {code} > ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * > Binary file hadoop-client-runtime-3.3.1.jar matches > {code} > It causes the exception I mentioned. > I'm not sure what is the best answer. > Worka
[jira] [Commented] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723777#comment-17723777 ] Shuaipeng Lee commented on SPARK-40964: --- Thanks for your commits. I rebuild the hadoop-client-api and can start history server seccessfully. I change the pom.xml of hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following config ```xml javax/servlet/ ${shaded.dependency.prefix}.javax.servlet. **/pom.xml ``` build hadoop-client-api ```shell mvn package -DskipTests ``` > Cannot run spark history server with shaded hadoop jar > -- > > Key: SPARK-40964 > URL: https://issues.apache.org/jira/browse/SPARK-40964 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.2 >Reporter: YUBI LEE >Priority: Major > > Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. > If you try to start Spark History Server with shaded client jars and enable > security using > org.apache.hadoop.security.authentication.server.AuthenticationFilter, you > will meet following exception. > {code} > # spark-env.sh > export > SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter > > -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"' > {code} > {code} > # spark history server's out file > 22/10/27 15:29:48 INFO AbstractConnector: Started > ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} > 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' > on port 18081. > 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: > org.apache.hadoop.security.authentication.server.AuthenticationFilter > 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer > java.lang.IllegalStateException: class > org.apache.hadoop.security.authentication.server.AuthenticationFilter is not > a javax.servlet.Filter > at > org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) > at > org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) > at > org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) > at > java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at > org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) > at > org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) > at > org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) > at > org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) > at > org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) > at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) > at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) > at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) > at > org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) > at > org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) > {code} > I think "AuthenticationFilter" in the shaded jar imports > "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". > {code} > ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * > Binary file hadoop-client-runtime-3.3.1.jar matches > {code} > It causes the exception I mentioned. > I'm not sure what is the best answer. > Workaround is not to use spark with pre-built for Apache Hadoop, specify > `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History > Server. > May be the possible options are: > - Not to shade "javax.servlet.Filter" at hadoop shaded jar > - Or, shade "javax.servlet.Filter" also at jetty. -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Updated] (SPARK-43548) Remove workaround for HADOOP-16255
[ https://issues.apache.org/jira/browse/SPARK-43548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43548: Component/s: Structured Streaming (was: SQL) > Remove workaround for HADOOP-16255 > -- > > Key: SPARK-43548 > URL: https://issues.apache.org/jira/browse/SPARK-43548 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43548) Remove workaround for HADOOP-16255
BingKun Pan created SPARK-43548: --- Summary: Remove workaround for HADOOP-16255 Key: SPARK-43548 URL: https://issues.apache.org/jira/browse/SPARK-43548 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43488) bitmap function
[ https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723772#comment-17723772 ] Jia Fan commented on SPARK-43488: - Hi, [~cloud_fan] If we want achieve this feature. Are we should implement new datatype like BitMap, then BitMap can use Roaringbitmap(Or just bigint) as data layer? Or we just use datatype bigint, then bitmapBuild(array[int]) will return bigint. The second way will be easiler. The fisrt way will be more flexible, we can implement different data layer for different array size, just like `Roaringbitmap`. I want to achieve this feature, but I have some wrong about which plan should I choose. > bitmap function > --- > > Key: SPARK-43488 > URL: https://issues.apache.org/jira/browse/SPARK-43488 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: yiku123 >Priority: Major > > maybe spark need to have some bitmap functions? example like bitmapBuild > 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。 > This is often used in user profiling applications but i don't find in spark > > > h2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43503) Deserialisation Failure on State Store Schema Evolution (Spark Structured Streaming)
[ https://issues.apache.org/jira/browse/SPARK-43503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Arora updated SPARK-43503: Priority: Critical (was: Major) > Deserialisation Failure on State Store Schema Evolution (Spark Structured > Streaming) > > > Key: SPARK-43503 > URL: https://issues.apache.org/jira/browse/SPARK-43503 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Varun Arora >Priority: Critical > > In streaming query, state is persisted in RocksDB using mapGroupsWithState > function. We use Encoders.bean to serialise state and store it in State > Store. Code snippet :- > > {code:java} > df > .groupByKey((MapFunction) event -> > event.getAs("stateGroupingId"), Encoders.STRING()) > .mapGroupsWithState(mapGroupsWithStateFunction, > Encoders.bean(StateInfo.class), Encoders.bean(StateOutput.class), > GroupStateTimeout.ProcessingTimeTimeout()); {code} > As per the above example, StateInfo bean contains state information which is > stored in State store. However, on adding/removing field from StateInfo bean > and on re-running query we get deserialisation exception. Is there a way to > handle this scenario or to provide custom deserialisation to handle schema > evolution. > Exception :- > {code:java} > Stack: [0x7cd0a000,0x7ce0a000], sp=0x7ce08400, free > space=1017k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > V [libjvm.dylib+0x57c2b5] Unsafe_GetLong+0x55 > J 8700 sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ > 0x00010fe9e6be [0x00010fe9e600+0xbe] > j org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5 > j > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.pointTo(Ljava/lang/Object;JI)V+2 > j > org.apache.spark.sql.catalyst.expressions.UnsafeMapData.pointTo(Ljava/lang/Object;JI)V+187 > j > org.apache.spark.sql.catalyst.expressions.UnsafeRow.getMap(I)Lorg/apache/spark/sql/catalyst/expressions/UnsafeMapData;+52 > j > org.apache.spark.sql.catalyst.expressions.UnsafeRow.getMap(I)Lorg/apache/spark/sql/catalyst/util/MapData;+2 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.MapObjects_0$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificSafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/util/ArrayData;+53 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.StaticInvoke_0$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificSafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/util/Map;+14 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.initializeJavaBean_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificSafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/example/streaming/StateInfo;)V+2 > j > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+74 > j > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$deserializeRowToObject$1(Lorg/apache/spark/sql/catalyst/expressions/package$Projection;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;+2 > j > org.apache.spark.sql.execution.ObjectOperator$$$Lambda$2600.apply(Ljava/lang/Object;)Ljava/lang/Object;+12 > j > org.apache.spark.sql.execution.streaming.state.FlatMapGroupsWithStateExecHelper$StateManagerImplBase.getStateObject(Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;)Ljava/lang/Object;+9 > j > org.apache.spark.sql.execution.streaming.state.FlatMapGroupsWithStateExecHelper$StateManagerImplBase.getState(Lorg/apache/spark/sql/execution/streaming/state/StateStore;Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;)Lorg/apache/spark/sql/execution/streaming/state/FlatMapGroupsWithStateExecHelper$StateData;+16 > j > org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$InputProcessor.$anonfun$processNewData$1(Lorg/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec$InputProcessor;Lscala/Tuple2;)Lscala/collection/GenTraversableOnce;+45 > j > org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$InputProcessor$$Lambda$3237.apply(Ljava/lang/Object;)Ljava/lang/Object;+8 > J 5928 C2 scala.collection.Iterator$$anon$11.hasNext()Z (35 bytes) @ > 0x00010e9a6a58 [0x00010e9a6620+0x438] > j org.apache.spark.util.CompletionIterator.hasNext()Z+4 > j scala.collection.Iterator$ConcatIterator.hasNext()Z+22 > j org.apache.spark.util.CompletionIterator.hasNext()Z+4 > J 2875 C1 scala.collection.It
[jira] [Assigned] (SPARK-43022) protobuf functions
[ https://issues.apache.org/jira/browse/SPARK-43022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-43022: Assignee: Yang Jie > protobuf functions > -- > > Key: SPARK-43022 > URL: https://issues.apache.org/jira/browse/SPARK-43022 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43022) protobuf functions
[ https://issues.apache.org/jira/browse/SPARK-43022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-43022. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40654 [https://github.com/apache/spark/pull/40654] > protobuf functions > -- > > Key: SPARK-43022 > URL: https://issues.apache.org/jira/browse/SPARK-43022 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43542) Define a new error class and apply for the case where streaming query fails due to concurrent run of streaming query with same checkpoint
[ https://issues.apache.org/jira/browse/SPARK-43542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-43542: - Affects Version/s: 3.5.0 (was: 1.6.3) > Define a new error class and apply for the case where streaming query fails > due to concurrent run of streaming query with same checkpoint > - > > Key: SPARK-43542 > URL: https://issues.apache.org/jira/browse/SPARK-43542 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Eric Marnadi >Priority: Major > > We are migrating to a new error framework in order to surface errors in a > friendlier way to customers. This PR defines a new error class specifically > for when there are concurrent updates to the log for the same batch ID -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43547) Update "Supported Pandas API" page to point out the proper pandas docs
Haejoon Lee created SPARK-43547: --- Summary: Update "Supported Pandas API" page to point out the proper pandas docs Key: SPARK-43547 URL: https://issues.apache.org/jira/browse/SPARK-43547 Project: Spark Issue Type: Bug Components: Documentation, Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee [https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api] not point out the wrong pandas version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43544) Fix nested MapType behavior in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43544: - Summary: Fix nested MapType behavior in Pandas UDF (was: Standardize nested non-atomic input type support in Pandas UDF) > Fix nested MapType behavior in Pandas UDF > - > > Key: SPARK-43544 > URL: https://issues.apache.org/jira/browse/SPARK-43544 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43509) Support creating multiple sessions for Spark Connect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-43509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723755#comment-17723755 ] Ignite TC Bot commented on SPARK-43509: --- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/41206 > Support creating multiple sessions for Spark Connect in PySpark > --- > > Key: SPARK-43509 > URL: https://issues.apache.org/jira/browse/SPARK-43509 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43546) Complete Pandas UDF parity tests
Xinrong Meng created SPARK-43546: Summary: Complete Pandas UDF parity tests Key: SPARK-43546 URL: https://issues.apache.org/jira/browse/SPARK-43546 Project: Spark Issue Type: Test Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Tests as shown below should be added to Connect. test_pandas_udf_grouped_agg.py test_pandas_udf_scalar.py test_pandas_udf_window.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43545) Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION
Xinrong Meng created SPARK-43545: Summary: Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION Key: SPARK-43545 URL: https://issues.apache.org/jira/browse/SPARK-43545 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43544) Standardize nested non-atomic input type support in Pandas UDF
Xinrong Meng created SPARK-43544: Summary: Standardize nested non-atomic input type support in Pandas UDF Key: SPARK-43544 URL: https://issues.apache.org/jira/browse/SPARK-43544 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43543) Standardize Nested Complex DataTypes Support
Xinrong Meng created SPARK-43543: Summary: Standardize Nested Complex DataTypes Support Key: SPARK-43543 URL: https://issues.apache.org/jira/browse/SPARK-43543 Project: Spark Issue Type: Umbrella Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43436) Upgrade rocksdbjni to 8.1.1.1
[ https://issues.apache.org/jira/browse/SPARK-43436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43436. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41122 [https://github.com/apache/spark/pull/41122] > Upgrade rocksdbjni to 8.1.1.1 > - > > Key: SPARK-43436 > URL: https://issues.apache.org/jira/browse/SPARK-43436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > https://github.com/facebook/rocksdb/releases/tag/v8.1.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43436) Upgrade rocksdbjni to 8.1.1.1
[ https://issues.apache.org/jira/browse/SPARK-43436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43436: - Assignee: Yang Jie > Upgrade rocksdbjni to 8.1.1.1 > - > > Key: SPARK-43436 > URL: https://issues.apache.org/jira/browse/SPARK-43436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > https://github.com/facebook/rocksdb/releases/tag/v8.1.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43542) Define a new error class and apply for the case where streaming query fails due to concurrent run of streaming query with same checkpoint
Eric Marnadi created SPARK-43542: Summary: Define a new error class and apply for the case where streaming query fails due to concurrent run of streaming query with same checkpoint Key: SPARK-43542 URL: https://issues.apache.org/jira/browse/SPARK-43542 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 1.6.3 Reporter: Eric Marnadi We are migrating to a new error framework in order to surface errors in a friendlier way to customers. This PR defines a new error class specifically for when there are concurrent updates to the log for the same batch ID -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING
[ https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-43541: - Description: This was tested on Spark 3.3.2 and Spark 3.4.0. {code} Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4, pos 7 {code} FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get the query to work with any of these modifications. {code} # -- FULL OUTER JOIN WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b USING (key) WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7 # -- INNER JOIN WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a JOIN gcp_pro_b USING (key) WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; +-+ | key | |-| | a | +-+ 1 row in set Time: 0.507s # -- NO Filter WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b USING (key); +-+ | key | |-| | a | +-+ 1 row in set Time: 1.021s # -- ON instead of USING WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; +-+ | key | |-| | a | +-+ 1 row in set Time: 0.514s {code} was: This was tested on Spark 3.3.2 and Spark 3.4.0. {code} Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4, pos 7 {code} FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get the query to work with any of these modifications. {code} # WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b USING (key) WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7 # -- INNER JOIN WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a JOIN gcp_pro_b USING (key) WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; +-+ | key | |-| | a | +-+ 1 row in set Time: 0.507s # -- NO Filter WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b USING (key); +-+ | key | |-| | a | +-+ 1 row in set Time: 1.021s # -- ON instead of USING WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; +-+ | key | |-| | a | +-+ 1 row in set Time: 0.514s {code} > Incorrect column resolution on FULL OUTER JOIN with USING > - > > Key: SPARK-43541 > URL: https://issues.apache.org/jira/browse/SPARK-43541 > Project: Spark > Issue Type: Bug >
[jira] [Created] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING
Max Gekk created SPARK-43541: Summary: Incorrect column resolution on FULL OUTER JOIN with USING Key: SPARK-43541 URL: https://issues.apache.org/jira/browse/SPARK-43541 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.3.2 Reporter: Max Gekk Assignee: Max Gekk This was tested on Spark 3.3.2 and Spark 3.4.0. {code} Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4, pos 7 {code} FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get the query to work with any of these modifications. {code} # WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b USING (key) WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7 # -- INNER JOIN WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a JOIN gcp_pro_b USING (key) WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; +-+ | key | |-| | a | +-+ 1 row in set Time: 0.507s # -- NO Filter WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b USING (key); +-+ | key | |-| | a | +-+ 1 row in set Time: 1.021s # -- ON instead of USING WITH aws_dbr_a AS (select key from values ('a') t(key)), gcp_pro_b AS (select key from values ('a') t(key)) SELECT aws_dbr_a.key FROM aws_dbr_a FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%'; +-+ | key | |-| | a | +-+ 1 row in set Time: 0.514s {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4
[ https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43537. -- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/41198 > Upgrade the asm deps in the tools module to 9.4 > --- > > Key: SPARK-43537 > URL: https://issues.apache.org/jira/browse/SPARK-43537 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4
[ https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-43537: - Priority: Minor (was: Major) > Upgrade the asm deps in the tools module to 9.4 > --- > > Key: SPARK-43537 > URL: https://issues.apache.org/jira/browse/SPARK-43537 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43540) Add working directory into classpath on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43540: - Summary: Add working directory into classpath on the driver in K8S cluster mode (was: Add current working directory into classpath on the driver in K8S cluster mode) > Add working directory into classpath on the driver in K8S cluster mode > -- > > Key: SPARK-43540 > URL: https://issues.apache.org/jira/browse/SPARK-43540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > In Yarn cluster modes, the passed files/jars are able to be accessed in the > classloader. Looks like this is not the case in Kubernetes cluster mode. > After SPARK-33782, it places spark.files, spark.jars and spark.files under > the current working directory on the driver in K8S cluster mode. but the > spark.files and spark.jars seems are not accessible by the classloader. > > we need to add the current working directory into classpath. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43540: - Description: In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. After SPARK-33782, it places spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode. but the spark.files and spark.jars seems are not accessible by the classloader. we need to add the current working directory into classpath. was: In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. After SPARK-33782, it places spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode. but the spark.files and spark.jars seems are not accessible by the classloader. we need to add the current working directory to classpath. > Add current working directory into classpath on the driver in K8S cluster mode > -- > > Key: SPARK-43540 > URL: https://issues.apache.org/jira/browse/SPARK-43540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > In Yarn cluster modes, the passed files/jars are able to be accessed in the > classloader. Looks like this is not the case in Kubernetes cluster mode. > After SPARK-33782, it places spark.files, spark.jars and spark.files under > the current working directory on the driver in K8S cluster mode. but the > spark.files and spark.jars seems are not accessible by the classloader. > > we need to add the current working directory into classpath. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43540: - Description: In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. After SPARK-33782, it places spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode. but the spark.files and spark.jars seems are not accessible by the classloader. we need to add the current working directory to classpath. was: In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. After SPARK-33782, for Kubernetes cluster mode, it places > Add current working directory into classpath on the driver in K8S cluster mode > -- > > Key: SPARK-43540 > URL: https://issues.apache.org/jira/browse/SPARK-43540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > In Yarn cluster modes, the passed files/jars are able to be accessed in the > classloader. Looks like this is not the case in Kubernetes cluster mode. > After SPARK-33782, it places spark.files, spark.jars and spark.files under > the current working directory on the driver in K8S cluster mode. but the > spark.files and spark.jars seems are not accessible by the classloader. > > we need to add the current working directory to classpath. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43540: - Description: In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. After SPARK-33782, for Kubernetes cluster mode, it places was: In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. > Add current working directory into classpath on the driver in K8S cluster mode > -- > > Key: SPARK-43540 > URL: https://issues.apache.org/jira/browse/SPARK-43540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > In Yarn cluster modes, the passed files/jars are able to be accessed in the > classloader. Looks like this is not the case in Kubernetes cluster mode. > After SPARK-33782, for Kubernetes cluster mode, it places -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode
Fei Wang created SPARK-43540: Summary: Add current working directory into classpath on the driver in K8S cluster mode Key: SPARK-43540 URL: https://issues.apache.org/jira/browse/SPARK-43540 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Fei Wang In Yarn cluster modes, the passed files/jars are able to be accessed in the classloader. Looks like this is not the case in Kubernetes cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4
[ https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723478#comment-17723478 ] GridGain Integration commented on SPARK-43537: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/41198 > Upgrade the asm deps in the tools module to 9.4 > --- > > Key: SPARK-43537 > URL: https://issues.apache.org/jira/browse/SPARK-43537 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues
[ https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-43535. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41184 [https://github.com/apache/spark/pull/41184] > Adjust the ImportOrderChecker rule to resolve long-standing import issues > - > > Key: SPARK-43535 > URL: https://issues.apache.org/jira/browse/SPARK-43535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues
[ https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-43535: Assignee: BingKun Pan > Adjust the ImportOrderChecker rule to resolve long-standing import issues > - > > Key: SPARK-43535 > URL: https://issues.apache.org/jira/browse/SPARK-43535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723433#comment-17723433 ] Yuming Wang edited comment on SPARK-43538 at 5/17/23 12:36 PM: --- Yes. I think so: https://github.com/Homebrew/homebrew-core/pull/131189 was (Author: q79969786): Yes. I think so. > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723433#comment-17723433 ] Yuming Wang commented on SPARK-43538: - Yes. I think so. > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723430#comment-17723430 ] Ghislain Fourny edited comment on SPARK-43538 at 5/17/23 12:04 PM: --- Thanks, Yuming Wang! Does it mean the apache-spark Homebrew Formulae should then be adapted to openjdk@17 (or 8 or 11) to avoid unpredictable behavior? was (Author: JIRAUSER300463): Thanks, Yuming Wang! Does it mean the Homebrew Formulae should then be adapted to openjdk@17 (or 8 or 11) to avoid unpredictable behavior? > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723430#comment-17723430 ] Ghislain Fourny commented on SPARK-43538: - Thanks, Yuming Wang! Does it mean the Homebrew Formulae should then be adapted to openjdk@17 (or 8 or 11) to avoid unpredictable behavior? > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723429#comment-17723429 ] Yuming Wang commented on SPARK-43538: - We have not tested on Java 20, because Java 20 is not LTS. > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43539) Assign a name to the error class _LEGACY_ERROR_TEMP_0003
BingKun Pan created SPARK-43539: --- Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0003 Key: SPARK-43539 URL: https://issues.apache.org/jira/browse/SPARK-43539 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43536) Statsd sink reporter reports incorrect counter metrics.
[ https://issues.apache.org/jira/browse/SPARK-43536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723415#comment-17723415 ] Ignite TC Bot commented on SPARK-43536: --- User 'venkateshbalaji99' has created a pull request for this issue: https://github.com/apache/spark/pull/41199 > Statsd sink reporter reports incorrect counter metrics. > --- > > Key: SPARK-43536 > URL: https://issues.apache.org/jira/browse/SPARK-43536 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.3 >Reporter: Abhishek Modi >Priority: Major > > There is a mismatch between the definition of counter metrics between > dropwizard (which is used by spark) and statsD. While Dropwizard interprets > counters as cumulative metrics, statsD interprets them as delta metrics. This > causes double aggregation in statsd causing inconsistent metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ghislain Fourny updated SPARK-43538: Summary: Spark Homebrew Formulae currently depends on non-officially-supported Java 20 (was: Spark Homebrew recipe currently depends on non-officially-supported Java 20) > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43538) Spark Homebrew recipe currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ghislain Fourny updated SPARK-43538: Summary: Spark Homebrew recipe currently depends on non-officially-supported Java 20 (was: Spark Homebrew recipe falls back to non-officially-supported Java 20) > Spark Homebrew recipe currently depends on non-officially-supported Java 20 > --- > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Priority: Minor > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43538) Spark Homebrew recipe falls back to non-officially-supported Java 20
Ghislain Fourny created SPARK-43538: --- Summary: Spark Homebrew recipe falls back to non-officially-supported Java 20 Key: SPARK-43538 URL: https://issues.apache.org/jira/browse/SPARK-43538 Project: Spark Issue Type: Request Components: Java API Affects Versions: 3.4.0, 3.3.2, 3.2.4 Environment: Homebrew (e.g., macOS) Reporter: Ghislain Fourny I am not sure if homebrew-related issues can also be reported here? The Homebrew formulae for apache-spark runs on (latest) openjdk 20. [https://formulae.brew.sh/formula/apache-spark] However, Apache Spark is documented to work with Java 8/11/17: [https://spark.apache.org/docs/latest/] Is this an overlook, or is Java 20 officially supported, too? Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4
Yang Jie created SPARK-43537: Summary: Upgrade the asm deps in the tools module to 9.4 Key: SPARK-43537 URL: https://issues.apache.org/jira/browse/SPARK-43537 Project: Spark Issue Type: Improvement Components: Build, Project Infra Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43536) Statsd sink reporter reports incorrect counter metrics.
Abhishek Modi created SPARK-43536: - Summary: Statsd sink reporter reports incorrect counter metrics. Key: SPARK-43536 URL: https://issues.apache.org/jira/browse/SPARK-43536 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.3 Reporter: Abhishek Modi There is a mismatch between the definition of counter metrics between dropwizard (which is used by spark) and statsD. While Dropwizard interprets counters as cumulative metrics, statsD interprets them as delta metrics. This causes double aggregation in statsd causing inconsistent metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues
[ https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723389#comment-17723389 ] ASF GitHub Bot commented on SPARK-43535: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41184 > Adjust the ImportOrderChecker rule to resolve long-standing import issues > - > > Key: SPARK-43535 > URL: https://issues.apache.org/jira/browse/SPARK-43535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues
[ https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723390#comment-17723390 ] ASF GitHub Bot commented on SPARK-43535: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41184 > Adjust the ImportOrderChecker rule to resolve long-standing import issues > - > > Key: SPARK-43535 > URL: https://issues.apache.org/jira/browse/SPARK-43535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388 ] caican edited comment on SPARK-43526 at 5/17/23 9:03 AM: - [~yumwang] Tpcds tests show performance gains for most queries and we plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! sortMergeJoin is preferred: !image-2023-05-17-16-54-59-053.png|width=722,height=319! was (Author: JIRAUSER280464): [~yumwang] Tpcds tests show performance gains for most queries and we plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388 ] caican edited comment on SPARK-43526 at 5/17/23 9:02 AM: - [~yumwang] Tpcds tests show performance gains for most queries and we plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! was (Author: JIRAUSER280464): [~yumwang] We plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43523) Memory leak in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amine Bagdouri updated SPARK-43523: --- Description: We have a distributed Spark application running on Azure HDInsight using Spark version 2.4.4. After a few days of active processing on our application, we have noticed that the GC CPU time ratio of the driver is close to 100%. We suspected a memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory Analyzer. Here is some interesting data from the driver's heap dump (heap size is 8 GB): * The estimated retained heap size of String objects (~5M instances) is 3.3 GB. It seems that most of these instances correspond to spark events. * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. * The number of LiveJob objects with status "RUNNING" is 18K, knowing that there shouldn't be more than 16 live running jobs since we use a fixed size thread pool of 16 threads to run spark queries. * The number of LiveTask objects is 485K. * The AsyncEventQueue instance associated to the AppStatusListener has a value of 854 for dropped events count and a value of 10001 for total events count, knowing that the dropped events counter is reset every minute and that the queue's default capacity is 1. We think that there is a memory leak in Spark UI. Here is our analysis of the root cause of this leak: * AppStatusListener is notified of Spark events using a bounded queue in AsyncEventQueue. * AppStatusListener updates its state (kvstore, liveTasks, liveStages, liveJobs, ...) based on the received events. For example, onTaskStart adds a task to liveTasks map and onTaskEnd removes the task from liveTasks map. * When the rate of events is very high, the bounded queue in AsyncEventQueue is full, some events are dropped and don't make it to AppStatusListener. * Dropped events that signal the end of a processing unit prevent the state of AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, will prevent the task from being removed from liveTasks map, and the task will remain in the heap until the driver's JVM is stopped. We were able to confirm our analysis by reducing the capacity of the AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After having launched many spark queries using this config, we observed that the number of active jobs in Spark UI increased rapidly and remained high even though all submitted queries have completed. We have also noticed that some executor task counters in Spark UI were negative, which confirms that AppStatusListener state does not accurately reflect the reality and that it can be a victim of event drops. Suggested fix: There are some limits today on the number of "dead" objects in AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest enforcing another configurable limit on the number of total objects in AppStatusListener's maps and kvstore. This should limit the leak in the case of high events rate, but AppStatusListener stats will remain inaccurate. was: We have a distributed Spark application running on Azure HDInsight using Spark version 2.4.4. After a few days of active processing on our application, we have noticed that the GC CPU time ratio of the driver is close to 100%. We suspected a memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory Analyzer. Here is some interesting data from the driver's heap dump (heap size is 8 GB): * The estimated retained heap size of String objects (~5M instances) is 3.3 GB. It seems that most of these instances correspond to spark events. * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. * The number of LiveJob objects with status "RUNNING" is 18K, knowing that there shouldn't be more than 16 live running jobs since we use a fixed thread pool of 16 threads to run spark queries. * The number of LiveTask objects is 485K. * The AsyncEventQueue instance associated to the AppStatusListener has a value of 854 for dropped events count and a value of 10001 for total events count, knowing that the dropped events counter is reset every minute and that the queue's default capacity is 1. We think that there is a memory leak in Spark UI. Here is our analysis of the root cause of this leak: * AppStatusListener is notified of Spark events using a bounded queue in AsyncEventQueue. * AppStatusListener updates its state (kvstore, liveTasks, liveStages, liveJobs, ...) based on the received events. For example, onTaskStart adds a task to liveTasks map and onTaskEnd removes the task from liveTasks map. * When the rate of events is very high, the bounded queue in AsyncEventQueue is full, some events are dropped and don't make it to AppStatusListener. * Dropped events that signal the end of a processing unit prevent the state of
[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388 ] caican commented on SPARK-43526: [~yumwang] We plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-17-16-54-59-053.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-17-16-53-42-302.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723216#comment-17723216 ] Svyatoslav Semenyuk edited comment on SPARK-43514 at 5/17/23 8:35 AM: -- -We applied "current workaround" to application code and this does not solve the issue.- UPD: issue was resolved in application by calling `.cache()` DF method. was (Author: JIRAUSER300434): ~We applied "current workaround" to application code and this does not solve the issue.~ UPD: issue was resolved in application by calling `.cache()` DF method. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.2 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caus
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Description: We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added {{show}} calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input data frames have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a data frame to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one data frame: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions._ def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toBeMatched2 = directors .filter(length($"name") > 2) .select( $"name", $"id" as "directorId", ) similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6) } def process2(df: DataFrame): Unit = { def process_financials(column: Column): Column = { transform( column, x => x.withField("date", to_timestamp(x("date"), "dd MMM ")), ) }
[jira] [Comment Edited] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723216#comment-17723216 ] Svyatoslav Semenyuk edited comment on SPARK-43514 at 5/17/23 8:34 AM: -- ~We applied "current workaround" to application code and this does not solve the issue.~ UPD: issue was resolved in application by calling `.cache()` DF method. was (Author: JIRAUSER300434): We applied "current workaround" to application code and this does not solve the issue. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.2 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have
[jira] [Commented] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723373#comment-17723373 ] Yuming Wang commented on SPARK-43534: - https://github.com/apache/spark/pull/41195 > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: hadoop log jars.png, log4j-1.2-api-2.20.0.jar, > log4j-slf4j2-impl-2.20.0.jar > > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} > Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default: > {noformat} > jars/log4j-1.2-api-2.20.0.jar > jars/log4j-slf4j2-impl-2.20.0.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > appender.file.type = RollingFile > appender.file.name = File > appender.file.fileName = /tmp/spark/logs/spark.log > appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log > appender.file.append = true > appender.file.layout.type = PatternLayout > appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : > %m%n > appender.file.policies.type = Policies > appender.file.policies.time.type = TimeBasedTriggeringPolicy > appender.file.policies.time.interval = 1 > appender.file.policies.time.modulate = true > appender.file.policies.size.type = SizeBasedTriggeringPolicy > appender.file.policies.size.size = 256M > appender.file.strategy.type = DefaultRolloverStrategy > appender.file.strategy.max = 100 > {code} > Start Spark thriftserver: > {code:java} > sbin/start-thriftserver.sh > {code} > Check the log: > {code:sh} > cat /tmp/spark/logs/spark.log > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43534: Attachment: hadoop log jars.png > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: hadoop log jars.png, log4j-1.2-api-2.20.0.jar, > log4j-slf4j2-impl-2.20.0.jar > > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} > Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default: > {noformat} > jars/log4j-1.2-api-2.20.0.jar > jars/log4j-slf4j2-impl-2.20.0.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > appender.file.type = RollingFile > appender.file.name = File > appender.file.fileName = /tmp/spark/logs/spark.log > appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log > appender.file.append = true > appender.file.layout.type = PatternLayout > appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : > %m%n > appender.file.policies.type = Policies > appender.file.policies.time.type = TimeBasedTriggeringPolicy > appender.file.policies.time.interval = 1 > appender.file.policies.time.modulate = true > appender.file.policies.size.type = SizeBasedTriggeringPolicy > appender.file.policies.size.size = 256M > appender.file.strategy.type = DefaultRolloverStrategy > appender.file.strategy.max = 100 > {code} > Start Spark thriftserver: > {code:java} > sbin/start-thriftserver.sh > {code} > Check the log: > {code:sh} > cat /tmp/spark/logs/spark.log > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43534: Description: Build Spark: {code:sh} ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default: {noformat} jars/log4j-1.2-api-2.20.0.jar jars/log4j-slf4j2-impl-2.20.0.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} was: Build Spark: {code:sh} ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} Remove the following jars to spark-3.5.0-SNAPSHOT-bin-default: {noformat} jars/log4j-1.2-api-2.20.0.jar jars/log4j-slf4j2-impl-2.20.0.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar > > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} > Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default: > {noformat} > jars/log4j-1.2-api-2.20.0.jar > jars/log4j-slf4j2-impl-2.20.0.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > appender.file.type = RollingFile > appender.file.name = File > appender.file.fileName = /tmp/spark/logs/spark.log > appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log > appender.file.append = true > appender.file.layout.type = PatternLa
[jira] [Created] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues
BingKun Pan created SPARK-43535: --- Summary: Adjust the ImportOrderChecker rule to resolve long-standing import issues Key: SPARK-43535 URL: https://issues.apache.org/jira/browse/SPARK-43535 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43128) Streaming progress struct (especially in Scala)
[ https://issues.apache.org/jira/browse/SPARK-43128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-43128. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40892 [https://github.com/apache/spark/pull/40892] > Streaming progress struct (especially in Scala) > --- > > Key: SPARK-43128 > URL: https://issues.apache.org/jira/browse/SPARK-43128 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > Streaming spark connect transfers streaming progress as full “json”. > This works ok for Python since it does not have any schema defined. > But in Scala, it is a full fledged class. We need to decide if we want to > match legacy Progress struct in spark-connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43128) Streaming progress struct (especially in Scala)
[ https://issues.apache.org/jira/browse/SPARK-43128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-43128: Assignee: Yang Jie > Streaming progress struct (especially in Scala) > --- > > Key: SPARK-43128 > URL: https://issues.apache.org/jira/browse/SPARK-43128 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Yang Jie >Priority: Major > > Streaming spark connect transfers streaming progress as full “json”. > This works ok for Python since it does not have any schema defined. > But in Scala, it is a full fledged class. We need to decide if we want to > match legacy Progress struct in spark-connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43534: Description: Build Spark: {code:sh} ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} Remove the following jars to spark-3.5.0-SNAPSHOT-bin-default: {noformat} jars/log4j-1.2-api-2.20.0.jar jars/log4j-slf4j2-impl-2.20.0.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} was: Build Spark: {code:sh} ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: {noformat} guava-14.0.1.jar hadoop-client-api-3.3.5.jar hadoop-client-runtime-3.3.5.jar hadoop-shaded-guava-1.1.1.jar hadoop-yarn-server-web-proxy-3.3.5.jar slf4j-api-2.0.7.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar > > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code} > Remove the following jars to spark-3.5.0-SNAPSHOT-bin-default: > {noformat} > jars/log4j-1.2-api-2.20.0.jar > jars/log4j-slf4j2-impl-2.20.0.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > appender.file.type = RollingFile > appender.file.name = File > appender.file.fileName = /tmp/spark/logs/spark.log > appender.file.filePattern = /tmp/
[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43534: Attachment: log4j-slf4j2-impl-2.20.0.jar > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar > > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} > Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: > {noformat} > guava-14.0.1.jar > hadoop-client-api-3.3.5.jar > hadoop-client-runtime-3.3.5.jar > hadoop-shaded-guava-1.1.1.jar > hadoop-yarn-server-web-proxy-3.3.5.jar > slf4j-api-2.0.7.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > appender.file.type = RollingFile > appender.file.name = File > appender.file.fileName = /tmp/spark/logs/spark.log > appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log > appender.file.append = true > appender.file.layout.type = PatternLayout > appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : > %m%n > appender.file.policies.type = Policies > appender.file.policies.time.type = TimeBasedTriggeringPolicy > appender.file.policies.time.interval = 1 > appender.file.policies.time.modulate = true > appender.file.policies.size.type = SizeBasedTriggeringPolicy > appender.file.policies.size.size = 256M > appender.file.strategy.type = DefaultRolloverStrategy > appender.file.strategy.max = 100 > {code} > Start Spark thriftserver: > {code:java} > sbin/start-thriftserver.sh > {code} > Check the log: > {code:sh} > cat /tmp/spark/logs/spark.log > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43534: Attachment: log4j-1.2-api-2.20.0.jar > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar > > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} > Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: > {noformat} > guava-14.0.1.jar > hadoop-client-api-3.3.5.jar > hadoop-client-runtime-3.3.5.jar > hadoop-shaded-guava-1.1.1.jar > hadoop-yarn-server-web-proxy-3.3.5.jar > slf4j-api-2.0.7.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > appender.file.type = RollingFile > appender.file.name = File > appender.file.fileName = /tmp/spark/logs/spark.log > appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log > appender.file.append = true > appender.file.layout.type = PatternLayout > appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : > %m%n > appender.file.policies.type = Policies > appender.file.policies.time.type = TimeBasedTriggeringPolicy > appender.file.policies.time.interval = 1 > appender.file.policies.time.modulate = true > appender.file.policies.size.type = SizeBasedTriggeringPolicy > appender.file.policies.size.size = 256M > appender.file.strategy.type = DefaultRolloverStrategy > appender.file.strategy.max = 100 > {code} > Start Spark thriftserver: > {code:java} > sbin/start-thriftserver.sh > {code} > Check the log: > {code:sh} > cat /tmp/spark/logs/spark.log > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43534: Description: Build Spark: {code:sh} ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: {noformat} guava-14.0.1.jar hadoop-client-api-3.3.5.jar hadoop-client-runtime-3.3.5.jar hadoop-shaded-guava-1.1.1.jar hadoop-yarn-server-web-proxy-3.3.5.jar slf4j-api-2.0.7.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} was: Build Spark: {code:sh} ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} Copy the fellowing jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: {noformat} guava-14.0.1.jar hadoop-client-api-3.3.5.jar hadoop-client-runtime-3.3.5.jar hadoop-shaded-guava-1.1.1.jar hadoop-yarn-server-web-proxy-3.3.5.jar slf4j-api-2.0.7.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} > Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided > -- > > Key: SPARK-43534 > URL: https://issues.apache.org/jira/browse/SPARK-43534 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > Build Spark: > {code:sh} > ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver > -Pyarn -Phadoop-provided > tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} > Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: > {noformat} > guava-14.0.1.jar > hadoop-client-api-3.3.5.jar > hadoop-client-runtime-3.3.5.jar > hadoop-shaded-guava-1.1.1.jar > hadoop-yarn-server-web-proxy-3.3.5.jar > slf4j-api-2.0.7.jar > {noformat} > Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: > {code:none} > rootLogger.level = info > rootLogger.appenderRef.file.ref = File > rootLogger.appenderRef.stderr.ref = console > appender.console.type = Console > appender.console.name = console > appender.console.target = SYSTEM_ERR > appender.console.layout.type = PatternLayout > appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L > : %m%n > a
[jira] [Created] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
Yuming Wang created SPARK-43534: --- Summary: Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided Key: SPARK-43534 URL: https://issues.apache.org/jira/browse/SPARK-43534 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.4.0 Reporter: Yuming Wang Build Spark: {code:sh} ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code} Copy the fellowing jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/: {noformat} guava-14.0.1.jar hadoop-client-api-3.3.5.jar hadoop-client-runtime-3.3.5.jar hadoop-shaded-guava-1.1.1.jar hadoop-yarn-server-web-proxy-3.3.5.jar slf4j-api-2.0.7.jar {noformat} Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf: {code:none} rootLogger.level = info rootLogger.appenderRef.file.ref = File rootLogger.appenderRef.stderr.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.type = RollingFile appender.file.name = File appender.file.fileName = /tmp/spark/logs/spark.log appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log appender.file.append = true appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n appender.file.policies.type = Policies appender.file.policies.time.type = TimeBasedTriggeringPolicy appender.file.policies.time.interval = 1 appender.file.policies.time.modulate = true appender.file.policies.size.type = SizeBasedTriggeringPolicy appender.file.policies.size.size = 256M appender.file.strategy.type = DefaultRolloverStrategy appender.file.strategy.max = 100 {code} Start Spark thriftserver: {code:java} sbin/start-thriftserver.sh {code} Check the log: {code:sh} cat /tmp/spark/logs/spark.log {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org