[jira] [Updated] (SPARK-43491) In expression not compatible with EqualTo Expression

2023-05-17 Thread KuijianLiu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KuijianLiu updated SPARK-43491:
---
Description: 
The query results of Spark SQL 3.1.1  and Hive SQL 3.1.0 are inconsistent with 
same sql. Spark SQL calculates `{{{}0 in ('00')`{}}} as false, which act 
different from `{{{}=`{}}} keyword, but Hive calculates true. Hive is 
compatible with the `{{{}in`{}}} keyword in 3.1.0, but SparkSQL does not.

It's better  when dataTypes of elements in `{{{}In`{}}} expression are the 
same, it should behaviour as same as BinaryComparison like ` {{{}EqualTo`{}}}.

Test SQL:
{code:java}
scala> spark.sql("select 1 as test where 0 = '00'").show
++
|test|
++
|   1|
++

scala> spark.sql("select 1 as test where 0 in ('00')").show
++
|test|
++
++{code}
 

!image-2023-05-13-13-14-55-853.png!

  was:
The query results of Spark SQL 3.1.1  and Hive SQL 3.1.0 are inconsistent with 
same sql. Spark SQL calculates `{{{}0 in ('00')`{}}} as false, which act 
different from `{{{}=`{}}} keyword, but Hive calculates true. Hive is 
compatible with the `{{{}in`{}}} keyword in 3.1.0, but SparkSQL does not.

It's better  when dataTypes of elements in `{{{}In`{}}} expression are the 
same, it should behaviour as same as BinaryComparison like ` {{{}EqualTo`{}}}.

Test SQL:
{code:java}
scala> spark.sql("select 1 as test where 0 = '00'").show
++
|test|
++
|   1|
++

scala> spark.sql("select 1 as test where 0 in ('00')").show
++
|test|
++
++{code}
!image-2023-05-13-13-15-50-685.png!

!image-2023-05-13-13-14-55-853.png!


> In expression not compatible with EqualTo Expression
> 
>
> Key: SPARK-43491
> URL: https://issues.apache.org/jira/browse/SPARK-43491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: KuijianLiu
>Priority: Major
> Attachments: image-2023-05-13-13-14-55-853.png, 
> image-2023-05-13-13-15-50-685.png
>
>
> The query results of Spark SQL 3.1.1  and Hive SQL 3.1.0 are inconsistent 
> with same sql. Spark SQL calculates `{{{}0 in ('00')`{}}} as false, which act 
> different from `{{{}=`{}}} keyword, but Hive calculates true. Hive is 
> compatible with the `{{{}in`{}}} keyword in 3.1.0, but SparkSQL does not.
> It's better  when dataTypes of elements in `{{{}In`{}}} expression are the 
> same, it should behaviour as same as BinaryComparison like ` {{{}EqualTo`{}}}.
> Test SQL:
> {code:java}
> scala> spark.sql("select 1 as test where 0 = '00'").show
> ++
> |test|
> ++
> |   1|
> ++
> scala> spark.sql("select 1 as test where 0 in ('00')").show
> ++
> |test|
> ++
> ++{code}
>  
> !image-2023-05-13-13-14-55-853.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'

2023-05-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43522:
---

Assignee: Jia Fan

> Creating struct column occurs  error 'org.apache.spark.sql.AnalysisException 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
> -
>
> Key: SPARK-43522
> URL: https://issues.apache.org/jira/browse/SPARK-43522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Heedo Lee
>Assignee: Jia Fan
>Priority: Minor
>
> When creating a struct column in Dataframe, the code that ran without 
> problems in version 3.3.1 does not work in version 3.4.0.
>  
> Example
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code}
>  
> In 3.3.1
>  
> {code:java}
>  
> testDF.show()
> +---+---++ 
> |      value|      key_value|           map_entry| 
> +---+---++ 
> |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
> +---+---++
>  
> testDF.printSchema()
> root
>  |-- value: string (nullable = true)
>  |-- key_value: array (nullable = true)
>  |    |-- element: string (containsNull = false)
>  |-- map_entry: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
> {code}
>  
>  
> In 3.4.0
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot 
> resolve "struct(split(namedlambdavariable(), =, -1)[0], 
> split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only 
> foldable `STRING` expressions are allowed to appear at odd position, but they 
> are ["0", "1"].;
> 'Project [value#41, key_value#45, transform(key_value#45, 
> lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
> x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
> +- Project [value#41, split(value#41, ,, -1) AS key_value#45]
>    +- LocalRelation [value#41]  at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> 
>  
> {code}
>  
> However, if you do an alias to struct elements, you can get the same result 
> as the previous version.
>  
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0).as("col1") , split(x, 
> "=").getItem(1).as("col2") ) )){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'

2023-05-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43522.
-
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41187
[https://github.com/apache/spark/pull/41187]

> Creating struct column occurs  error 'org.apache.spark.sql.AnalysisException 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
> -
>
> Key: SPARK-43522
> URL: https://issues.apache.org/jira/browse/SPARK-43522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Heedo Lee
>Assignee: Jia Fan
>Priority: Minor
> Fix For: 3.5.0, 3.4.1
>
>
> When creating a struct column in Dataframe, the code that ran without 
> problems in version 3.3.1 does not work in version 3.4.0.
>  
> Example
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code}
>  
> In 3.3.1
>  
> {code:java}
>  
> testDF.show()
> +---+---++ 
> |      value|      key_value|           map_entry| 
> +---+---++ 
> |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
> +---+---++
>  
> testDF.printSchema()
> root
>  |-- value: string (nullable = true)
>  |-- key_value: array (nullable = true)
>  |    |-- element: string (containsNull = false)
>  |-- map_entry: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
> {code}
>  
>  
> In 3.4.0
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot 
> resolve "struct(split(namedlambdavariable(), =, -1)[0], 
> split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only 
> foldable `STRING` expressions are allowed to appear at odd position, but they 
> are ["0", "1"].;
> 'Project [value#41, key_value#45, transform(key_value#45, 
> lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
> x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
> +- Project [value#41, split(value#41, ,, -1) AS key_value#45]
>    +- LocalRelation [value#41]  at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> 
>  
> {code}
>  
> However, if you do an alias to struct elements, you can get the same result 
> as the previous version.
>  
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0).as("col1") , split(x, 
> "=").getItem(1).as("col2") ) )){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43157) TreeNode tags can become corrupted and hang driver when the dataset is cached

2023-05-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43157:
---

Assignee: Rob Reeves

> TreeNode tags can become corrupted and hang driver when the dataset is cached
> -
>
> Key: SPARK-43157
> URL: https://issues.apache.org/jira/browse/SPARK-43157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 3.5.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
>
> If a cached dataset is used by multiple other datasets materialized in 
> separate threads it can corrupt the TreeNode.tags map in any of the cached 
> plan nodes. This will hang the driver forever. This happens because 
> TreeNode.tags is not thread-safe. How this happens:
>  # Multiple datasets are materialized at the same time in different threads 
> that reference the same cached dataset
>  # AdaptiveSparkPlanExec.onUpdatePlan will call ExplainMode.fromString
>  # ExplainUtils uses the TreeNode.tags map to store the operator Id for every 
> node in the plan. This is usually okay because the plan is cloned. When there 
> is an InMemoryScanExec the InMemoryRelation.cachedPlan is not cloned so 
> multiple threads can set the operator Id.
> Making the TreeNode.tags field thread-safe does not solve this problem 
> because there is still a correctness issue. The threads may be overwriting 
> each other's operator Ids, which could be different.
> Example stack trace of the infinite loop:
> {code:scala}
> scala.collection.mutable.HashTable.resize(HashTable.scala:265)
> scala.collection.mutable.HashTable.addEntry0(HashTable.scala:158)
> scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:170)
> scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> scala.collection.mutable.HashMap.put(HashMap.scala:126)
> scala.collection.mutable.HashMap.update(HashMap.scala:131)
> org.apache.spark.sql.catalyst.trees.TreeNode.setTagValue(TreeNode.scala:108)
> org.apache.spark.sql.execution.ExplainUtils$.setOpId$1(ExplainUtils.scala:134)
> …
> org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:662){code}
> Example to show the cachedPlan object is not cloned:
> {code:java}
> import org.apache.spark.sql.execution.SparkPlan
> import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
> import spark.implicits._
> def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = {
>   if (plan.isInstanceOf[InMemoryTableScanExec]) {
>     Some(plan.asInstanceOf[InMemoryTableScanExec])
>   } else if (plan.children.isEmpty && plan.subqueries.isEmpty) {
>     None
>   } else {
>     (plan.subqueries.flatMap(p => findCacheOperator(p)) ++
>       plan.children.flatMap(findCacheOperator)).headOption
>   }
> }
> val df = spark.range(10).filter($"id" < 100).cache()
> val df1 = df.limit(1)
> val df2 = df.limit(1)
> // Get the cache operator (InMemoryTableScanExec) in each plan
> val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get
> val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get
> // Check if InMemoryTableScanExec references point to the same object
> println(plan1.eq(plan2))
> // returns false// Check if InMemoryRelation references point to the same 
> object
> println(plan1.relation.eq(plan2.relation))
> // returns false
> // Check if the cached SparkPlan references point to the same object
> println(plan1.relation.cachedPlan.eq(plan2.relation.cachedPlan))
> // returns true
> // This shows that the cloned plan2 still has references to the original 
> plan1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43157) TreeNode tags can become corrupted and hang driver when the dataset is cached

2023-05-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43157.
-
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40812
[https://github.com/apache/spark/pull/40812]

> TreeNode tags can become corrupted and hang driver when the dataset is cached
> -
>
> Key: SPARK-43157
> URL: https://issues.apache.org/jira/browse/SPARK-43157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 3.5.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> If a cached dataset is used by multiple other datasets materialized in 
> separate threads it can corrupt the TreeNode.tags map in any of the cached 
> plan nodes. This will hang the driver forever. This happens because 
> TreeNode.tags is not thread-safe. How this happens:
>  # Multiple datasets are materialized at the same time in different threads 
> that reference the same cached dataset
>  # AdaptiveSparkPlanExec.onUpdatePlan will call ExplainMode.fromString
>  # ExplainUtils uses the TreeNode.tags map to store the operator Id for every 
> node in the plan. This is usually okay because the plan is cloned. When there 
> is an InMemoryScanExec the InMemoryRelation.cachedPlan is not cloned so 
> multiple threads can set the operator Id.
> Making the TreeNode.tags field thread-safe does not solve this problem 
> because there is still a correctness issue. The threads may be overwriting 
> each other's operator Ids, which could be different.
> Example stack trace of the infinite loop:
> {code:scala}
> scala.collection.mutable.HashTable.resize(HashTable.scala:265)
> scala.collection.mutable.HashTable.addEntry0(HashTable.scala:158)
> scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:170)
> scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> scala.collection.mutable.HashMap.put(HashMap.scala:126)
> scala.collection.mutable.HashMap.update(HashMap.scala:131)
> org.apache.spark.sql.catalyst.trees.TreeNode.setTagValue(TreeNode.scala:108)
> org.apache.spark.sql.execution.ExplainUtils$.setOpId$1(ExplainUtils.scala:134)
> …
> org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:662){code}
> Example to show the cachedPlan object is not cloned:
> {code:java}
> import org.apache.spark.sql.execution.SparkPlan
> import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
> import spark.implicits._
> def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = {
>   if (plan.isInstanceOf[InMemoryTableScanExec]) {
>     Some(plan.asInstanceOf[InMemoryTableScanExec])
>   } else if (plan.children.isEmpty && plan.subqueries.isEmpty) {
>     None
>   } else {
>     (plan.subqueries.flatMap(p => findCacheOperator(p)) ++
>       plan.children.flatMap(findCacheOperator)).headOption
>   }
> }
> val df = spark.range(10).filter($"id" < 100).cache()
> val df1 = df.limit(1)
> val df2 = df.limit(1)
> // Get the cache operator (InMemoryTableScanExec) in each plan
> val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get
> val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get
> // Check if InMemoryTableScanExec references point to the same object
> println(plan1.eq(plan2))
> // returns false// Check if InMemoryRelation references point to the same 
> object
> println(plan1.relation.eq(plan2.relation))
> // returns false
> // Check if the cached SparkPlan references point to the same object
> println(plan1.relation.cachedPlan.eq(plan2.relation.cachedPlan))
> // returns true
> // This shows that the cloned plan2 still has references to the original 
> plan1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43571) Enable DateOpsTests.test_sub for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43571:
---

 Summary: Enable DateOpsTests.test_sub for pandas 2.0.0.
 Key: SPARK-43571
 URL: https://issues.apache.org/jira/browse/SPARK-43571
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DateOpsTests.test_sub for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43570) Enable DateOpsTests.test_rsub for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43570:
---

 Summary: Enable DateOpsTests.test_rsub for pandas 2.0.0.
 Key: SPARK-43570
 URL: https://issues.apache.org/jira/browse/SPARK-43570
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DateOpsTests.test_rsub for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43569) Remove workaround for HADOOP-14067

2023-05-17 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43569:
---

 Summary: Remove workaround for HADOOP-14067
 Key: SPARK-43569
 URL: https://issues.apache.org/jira/browse/SPARK-43569
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43568) Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43568:
---

 Summary: Enable CategoricalIndexTests.test_categories_setter for 
pandas 2.0.0.
 Key: SPARK-43568
 URL: https://issues.apache.org/jira/browse/SPARK-43568
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43548) Remove workaround for HADOOP-16255

2023-05-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43548:
-

Assignee: BingKun Pan

> Remove workaround for HADOOP-16255
> --
>
> Key: SPARK-43548
> URL: https://issues.apache.org/jira/browse/SPARK-43548
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43567) Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43567:
---

 Summary: Enable CategoricalIndexTests.test_factorize for pandas 
2.0.0.
 Key: SPARK-43567
 URL: https://issues.apache.org/jira/browse/SPARK-43567
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43548) Remove workaround for HADOOP-16255

2023-05-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43548.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41209
[https://github.com/apache/spark/pull/41209]

> Remove workaround for HADOOP-16255
> --
>
> Key: SPARK-43548
> URL: https://issues.apache.org/jira/browse/SPARK-43548
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43566) Enable CategoricalTests.test_categories_setter for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43566:
---

 Summary: Enable CategoricalTests.test_categories_setter for pandas 
2.0.0.
 Key: SPARK-43566
 URL: https://issues.apache.org/jira/browse/SPARK-43566
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable CategoricalTests.test_categories_setter for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43565) Enable CategoricalTests.test_as_ordered_unordered for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43565:
---

 Summary: Enable CategoricalTests.test_as_ordered_unordered for 
pandas 2.0.0.
 Key: SPARK-43565
 URL: https://issues.apache.org/jira/browse/SPARK-43565
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable CategoricalTests.test_as_ordered_unordered for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43564) Enable CategoricalTests.test_factorize for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43564:
---

 Summary: Enable CategoricalTests.test_factorize for pandas 2.0.0.
 Key: SPARK-43564
 URL: https://issues.apache.org/jira/browse/SPARK-43564
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable CategoricalTests.test_factorize for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43563) Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43563:
---

 Summary: Enable CsvTests.test_read_csv_with_squeeze for pandas 
2.0.0.
 Key: SPARK-43563
 URL: https://issues.apache.org/jira/browse/SPARK-43563
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43562) Enable DataFrameTests.test_append for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43562:
---

 Summary: Enable DataFrameTests.test_append for pandas 2.0.0.
 Key: SPARK-43562
 URL: https://issues.apache.org/jira/browse/SPARK-43562
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameTests.test_append for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43561) Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43561:
---

 Summary: Enable DataFrameConversionTests.test_to_latex for pandas 
2.0.0.
 Key: SPARK-43561
 URL: https://issues.apache.org/jira/browse/SPARK-43561
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43559) Enable DataFrameSlowTests.test_iteritems for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43559:
---

 Summary: Enable DataFrameSlowTests.test_iteritems for pandas 2.0.0.
 Key: SPARK-43559
 URL: https://issues.apache.org/jira/browse/SPARK-43559
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameSlowTests.test_iteritems for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43560) Enable DataFrameSlowTests.test_mad for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43560:
---

 Summary: Enable DataFrameSlowTests.test_mad for pandas 2.0.0.
 Key: SPARK-43560
 URL: https://issues.apache.org/jira/browse/SPARK-43560
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameSlowTests.test_mad for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43558) Enable DataFrameSlowTests.test_product for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43558:
---

 Summary: Enable DataFrameSlowTests.test_product for pandas 2.0.0.
 Key: SPARK-43558
 URL: https://issues.apache.org/jira/browse/SPARK-43558
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameSlowTests.test_product for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43557) Enable DataFrameSlowTests.test_between_time for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43557:
---

 Summary: Enable DataFrameSlowTests.test_between_time for pandas 
2.0.0.
 Key: SPARK-43557
 URL: https://issues.apache.org/jira/browse/SPARK-43557
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameSlowTests.test_between_time for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43556) Enable DataFrameSlowTests.test_describe for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43556:
---

 Summary: Enable DataFrameSlowTests.test_describe for pandas 2.0.0.
 Key: SPARK-43556
 URL: https://issues.apache.org/jira/browse/SPARK-43556
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable DataFrameSlowTests.test_describe for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43555) Enable GroupByTests.test_groupby_multiindex_columns for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43555:
---

 Summary: Enable GroupByTests.test_groupby_multiindex_columns for 
pandas 2.0.0.
 Key: SPARK-43555
 URL: https://issues.apache.org/jira/browse/SPARK-43555
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable GroupByTests.test_groupby_multiindex_columns for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43554) Enable GroupByTests.test_basic_stat_funcs for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43554:
---

 Summary: Enable GroupByTests.test_basic_stat_funcs for pandas 
2.0.0.
 Key: SPARK-43554
 URL: https://issues.apache.org/jira/browse/SPARK-43554
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable GroupByTests.test_basic_stat_funcs for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43553) Enable GroupByTests.test_mad for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43553:
---

 Summary: Enable GroupByTests.test_mad for pandas 2.0.0.
 Key: SPARK-43553
 URL: https://issues.apache.org/jira/browse/SPARK-43553
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable GroupByTests.test_mad for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43552) Enable GroupByTests.test_nth for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43552:
---

 Summary: Enable GroupByTests.test_nth for pandas 2.0.0.
 Key: SPARK-43552
 URL: https://issues.apache.org/jira/browse/SPARK-43552
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable GroupByTests.test_nth for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43551) Enable GroupByTests.test_prod for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43551:
---

 Summary: Enable GroupByTests.test_prod for pandas 2.0.0.
 Key: SPARK-43551
 URL: https://issues.apache.org/jira/browse/SPARK-43551
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable GroupByTests.test_prod for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43550) Enable SeriesTests.test_factorize for pandas 2.0.0.

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43550:
---

 Summary: Enable SeriesTests.test_factorize for pandas 2.0.0.
 Key: SPARK-43550
 URL: https://issues.apache.org/jira/browse/SPARK-43550
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable SeriesTests.test_factorize for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0035

2023-05-17 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723779#comment-17723779
 ] 

BingKun Pan commented on SPARK-43549:
-

I work on it.

> Assign a name to the error class _LEGACY_ERROR_TEMP_0035
> 
>
> Key: SPARK-43549
> URL: https://issues.apache.org/jira/browse/SPARK-43549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0035

2023-05-17 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43549:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0035
 Key: SPARK-43549
 URL: https://issues.apache.org/jira/browse/SPARK-43549
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43547) Update "Supported Pandas API" page to point out the proper pandas docs

2023-05-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43547:


Assignee: Haejoon Lee

> Update "Supported Pandas API" page to point out the proper pandas docs
> --
>
> Key: SPARK-43547
> URL: https://issues.apache.org/jira/browse/SPARK-43547
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> [https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api]
>  not point out the wrong pandas version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43547) Update "Supported Pandas API" page to point out the proper pandas docs

2023-05-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43547.
--
Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 41208
[https://github.com/apache/spark/pull/41208]

> Update "Supported Pandas API" page to point out the proper pandas docs
> --
>
> Key: SPARK-43547
> URL: https://issues.apache.org/jira/browse/SPARK-43547
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.1
>
>
> [https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api]
>  not point out the wrong pandas version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2023-05-17 Thread Shuaipeng Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723777#comment-17723777
 ] 

Shuaipeng Lee edited comment on SPARK-40964 at 5/18/23 2:53 AM:


Thanks for your commits. I rebuild the hadoop-client-api and can start history 
server seccessfully.

I change the pom.xml of 
hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following 
config



javax/servlet/
${shaded.dependency.prefix}.javax.servlet.

**/pom.xml




build hadoop-client-api

 

mvn package -DskipTests

 


was (Author: bigboy001):
Thanks for your commits. I rebuild the hadoop-client-api and can start history 
server seccessfully.

I change the pom.xml of 
hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following 
config

```xml

javax/servlet/
${shaded.dependency.prefix}.javax.servlet.

**/pom.xml

```

build hadoop-client-api

```shell

mvn package -DskipTests

```

> Cannot run spark history server with shaded hadoop jar
> --
>
> Key: SPARK-40964
> URL: https://issues.apache.org/jira/browse/SPARK-40964
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.2
>Reporter: YUBI LEE
>Priority: Major
>
> Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
> If you try to start Spark History Server with shaded client jars and enable 
> security using 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter, you 
> will meet following exception.
> {code}
> # spark-env.sh
> export 
> SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
>  
> -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
> {code}
> {code}
> # spark history server's out file
> 22/10/27 15:29:48 INFO AbstractConnector: Started 
> ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
> 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' 
> on port 18081.
> 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter
> 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
> java.lang.IllegalStateException: class 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter is not 
> a javax.servlet.Filter
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
> at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
> at 
> java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
> at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
> at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
> at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
> at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
> at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
> at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
> at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
> at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
> at 
> org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
> at 
> org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
> at 
> org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
> {code}
> I think "AuthenticationFilter" in the shaded jar imports 
> "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".
> {code}
> ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
> Binary file hadoop-client-runtime-3.3.1.jar matches
> {code}
> It causes the exception I mentioned.
> I'm not sure what is the best answer.
> Worka

[jira] [Commented] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2023-05-17 Thread Shuaipeng Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723777#comment-17723777
 ] 

Shuaipeng Lee commented on SPARK-40964:
---

Thanks for your commits. I rebuild the hadoop-client-api and can start history 
server seccessfully.

I change the pom.xml of 
hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following 
config

```xml

javax/servlet/
${shaded.dependency.prefix}.javax.servlet.

**/pom.xml

```

build hadoop-client-api

```shell

mvn package -DskipTests

```

> Cannot run spark history server with shaded hadoop jar
> --
>
> Key: SPARK-40964
> URL: https://issues.apache.org/jira/browse/SPARK-40964
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.2
>Reporter: YUBI LEE
>Priority: Major
>
> Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
> If you try to start Spark History Server with shaded client jars and enable 
> security using 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter, you 
> will meet following exception.
> {code}
> # spark-env.sh
> export 
> SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
>  
> -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
> {code}
> {code}
> # spark history server's out file
> 22/10/27 15:29:48 INFO AbstractConnector: Started 
> ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
> 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' 
> on port 18081.
> 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter
> 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
> java.lang.IllegalStateException: class 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter is not 
> a javax.servlet.Filter
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
> at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
> at 
> java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
> at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
> at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
> at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
> at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
> at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
> at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
> at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
> at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
> at 
> org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
> at 
> org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
> at 
> org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
> {code}
> I think "AuthenticationFilter" in the shaded jar imports 
> "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".
> {code}
> ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
> Binary file hadoop-client-runtime-3.3.1.jar matches
> {code}
> It causes the exception I mentioned.
> I'm not sure what is the best answer.
> Workaround is not to use spark with pre-built for Apache Hadoop, specify 
> `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
> Server.
> May be the possible options are:
> - Not to shade "javax.servlet.Filter" at hadoop shaded jar
> - Or, shade "javax.servlet.Filter" also at jetty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Updated] (SPARK-43548) Remove workaround for HADOOP-16255

2023-05-17 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43548:

Component/s: Structured Streaming
 (was: SQL)

> Remove workaround for HADOOP-16255
> --
>
> Key: SPARK-43548
> URL: https://issues.apache.org/jira/browse/SPARK-43548
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43548) Remove workaround for HADOOP-16255

2023-05-17 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43548:
---

 Summary: Remove workaround for HADOOP-16255
 Key: SPARK-43548
 URL: https://issues.apache.org/jira/browse/SPARK-43548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43488) bitmap function

2023-05-17 Thread Jia Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723772#comment-17723772
 ] 

Jia Fan commented on SPARK-43488:
-

Hi, [~cloud_fan] If we want achieve this feature. Are we should implement new 
datatype like BitMap, then BitMap can use Roaringbitmap(Or just bigint) as data 
layer? Or we just use datatype bigint, then bitmapBuild(array[int]) will return 
bigint. The second way will be easiler. The fisrt way will be more flexible, we 
can implement different data layer for different array size, just like 
`Roaringbitmap`.

I want to achieve this feature, but I have some wrong about which plan should I 
choose.

> bitmap function
> ---
>
> Key: SPARK-43488
> URL: https://issues.apache.org/jira/browse/SPARK-43488
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yiku123
>Priority: Major
>
> maybe spark need to have some bitmap functions? example  like bitmapBuild 
> 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。
> This is often used in user profiling applications but i don't find in spark
>  
>  
> h2.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43503) Deserialisation Failure on State Store Schema Evolution (Spark Structured Streaming)

2023-05-17 Thread Varun Arora (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Arora updated SPARK-43503:

Priority: Critical  (was: Major)

> Deserialisation Failure on State Store Schema Evolution (Spark Structured 
> Streaming)
> 
>
> Key: SPARK-43503
> URL: https://issues.apache.org/jira/browse/SPARK-43503
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Varun Arora
>Priority: Critical
>
> In streaming query, state is persisted in RocksDB using mapGroupsWithState 
> function. We use Encoders.bean to serialise state and store it in State 
> Store. Code snippet :-
>  
> {code:java}
> df
> .groupByKey((MapFunction) event -> 
> event.getAs("stateGroupingId"), Encoders.STRING())
> .mapGroupsWithState(mapGroupsWithStateFunction, 
> Encoders.bean(StateInfo.class), Encoders.bean(StateOutput.class), 
> GroupStateTimeout.ProcessingTimeTimeout()); {code}
> As per the above example, StateInfo bean contains state information which is 
> stored in State store. However, on adding/removing field from StateInfo bean 
> and on re-running query we get deserialisation exception. Is there a way to 
> handle this scenario or to provide custom deserialisation to handle schema 
> evolution.
> Exception :-
> {code:java}
> Stack: [0x7cd0a000,0x7ce0a000],  sp=0x7ce08400,  free 
> space=1017k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.dylib+0x57c2b5]  Unsafe_GetLong+0x55
> J 8700  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 
> 0x00010fe9e6be [0x00010fe9e600+0xbe]
> j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
> j  
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.pointTo(Ljava/lang/Object;JI)V+2
> j  
> org.apache.spark.sql.catalyst.expressions.UnsafeMapData.pointTo(Ljava/lang/Object;JI)V+187
> j  
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getMap(I)Lorg/apache/spark/sql/catalyst/expressions/UnsafeMapData;+52
> j  
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getMap(I)Lorg/apache/spark/sql/catalyst/util/MapData;+2
> j  
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.MapObjects_0$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificSafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/util/ArrayData;+53
> j  
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.StaticInvoke_0$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificSafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/util/Map;+14
> j  
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.initializeJavaBean_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificSafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/example/streaming/StateInfo;)V+2
> j  
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+74
> j  
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$deserializeRowToObject$1(Lorg/apache/spark/sql/catalyst/expressions/package$Projection;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;+2
> j  
> org.apache.spark.sql.execution.ObjectOperator$$$Lambda$2600.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
> j  
> org.apache.spark.sql.execution.streaming.state.FlatMapGroupsWithStateExecHelper$StateManagerImplBase.getStateObject(Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;)Ljava/lang/Object;+9
> j  
> org.apache.spark.sql.execution.streaming.state.FlatMapGroupsWithStateExecHelper$StateManagerImplBase.getState(Lorg/apache/spark/sql/execution/streaming/state/StateStore;Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;)Lorg/apache/spark/sql/execution/streaming/state/FlatMapGroupsWithStateExecHelper$StateData;+16
> j  
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$InputProcessor.$anonfun$processNewData$1(Lorg/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec$InputProcessor;Lscala/Tuple2;)Lscala/collection/GenTraversableOnce;+45
> j  
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$InputProcessor$$Lambda$3237.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
> J 5928 C2 scala.collection.Iterator$$anon$11.hasNext()Z (35 bytes) @ 
> 0x00010e9a6a58 [0x00010e9a6620+0x438]
> j  org.apache.spark.util.CompletionIterator.hasNext()Z+4
> j  scala.collection.Iterator$ConcatIterator.hasNext()Z+22
> j  org.apache.spark.util.CompletionIterator.hasNext()Z+4
> J 2875 C1 scala.collection.It

[jira] [Assigned] (SPARK-43022) protobuf functions

2023-05-17 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43022:


Assignee: Yang Jie

> protobuf functions
> --
>
> Key: SPARK-43022
> URL: https://issues.apache.org/jira/browse/SPARK-43022
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43022) protobuf functions

2023-05-17 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43022.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40654
[https://github.com/apache/spark/pull/40654]

> protobuf functions
> --
>
> Key: SPARK-43022
> URL: https://issues.apache.org/jira/browse/SPARK-43022
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43542) Define a new error class and apply for the case where streaming query fails due to concurrent run of streaming query with same checkpoint

2023-05-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-43542:
-
Affects Version/s: 3.5.0
   (was: 1.6.3)

> Define a new error class and apply for the case where streaming query fails 
> due to concurrent run of streaming query with same checkpoint
> -
>
> Key: SPARK-43542
> URL: https://issues.apache.org/jira/browse/SPARK-43542
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Eric Marnadi
>Priority: Major
>
> We are migrating to a new error framework in order to surface errors in a 
> friendlier way to customers. This PR defines a new error class specifically 
> for when there are concurrent updates to the log for the same batch ID



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43547) Update "Supported Pandas API" page to point out the proper pandas docs

2023-05-17 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43547:
---

 Summary: Update "Supported Pandas API" page to point out the 
proper pandas docs
 Key: SPARK-43547
 URL: https://issues.apache.org/jira/browse/SPARK-43547
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


[https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api]
 not point out the wrong pandas version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43544) Fix nested MapType behavior in Pandas UDF

2023-05-17 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43544:
-
Summary: Fix nested MapType behavior in Pandas UDF  (was: Standardize 
nested non-atomic input type support in Pandas UDF)

> Fix nested MapType behavior in Pandas UDF
> -
>
> Key: SPARK-43544
> URL: https://issues.apache.org/jira/browse/SPARK-43544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43509) Support creating multiple sessions for Spark Connect in PySpark

2023-05-17 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723755#comment-17723755
 ] 

Ignite TC Bot commented on SPARK-43509:
---

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/41206

> Support creating multiple sessions for Spark Connect in PySpark
> ---
>
> Key: SPARK-43509
> URL: https://issues.apache.org/jira/browse/SPARK-43509
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43546) Complete Pandas UDF parity tests

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43546:


 Summary: Complete Pandas UDF parity tests
 Key: SPARK-43546
 URL: https://issues.apache.org/jira/browse/SPARK-43546
 Project: Spark
  Issue Type: Test
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Tests as shown below should be added to Connect.

test_pandas_udf_grouped_agg.py
test_pandas_udf_scalar.py
test_pandas_udf_window.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43545) Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43545:


 Summary: Remove outdated UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION
 Key: SPARK-43545
 URL: https://issues.apache.org/jira/browse/SPARK-43545
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43544) Standardize nested non-atomic input type support in Pandas UDF

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43544:


 Summary: Standardize nested non-atomic input type support in 
Pandas UDF
 Key: SPARK-43544
 URL: https://issues.apache.org/jira/browse/SPARK-43544
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-17 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43543:


 Summary: Standardize Nested Complex DataTypes Support
 Key: SPARK-43543
 URL: https://issues.apache.org/jira/browse/SPARK-43543
 Project: Spark
  Issue Type: Umbrella
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43436) Upgrade rocksdbjni to 8.1.1.1

2023-05-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43436.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41122
[https://github.com/apache/spark/pull/41122]

> Upgrade rocksdbjni to 8.1.1.1
> -
>
> Key: SPARK-43436
> URL: https://issues.apache.org/jira/browse/SPARK-43436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> https://github.com/facebook/rocksdb/releases/tag/v8.1.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43436) Upgrade rocksdbjni to 8.1.1.1

2023-05-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43436:
-

Assignee: Yang Jie

> Upgrade rocksdbjni to 8.1.1.1
> -
>
> Key: SPARK-43436
> URL: https://issues.apache.org/jira/browse/SPARK-43436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> https://github.com/facebook/rocksdb/releases/tag/v8.1.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43542) Define a new error class and apply for the case where streaming query fails due to concurrent run of streaming query with same checkpoint

2023-05-17 Thread Eric Marnadi (Jira)
Eric Marnadi created SPARK-43542:


 Summary: Define a new error class and apply for the case where 
streaming query fails due to concurrent run of streaming query with same 
checkpoint
 Key: SPARK-43542
 URL: https://issues.apache.org/jira/browse/SPARK-43542
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 1.6.3
Reporter: Eric Marnadi


We are migrating to a new error framework in order to surface errors in a 
friendlier way to customers. This PR defines a new error class specifically for 
when there are concurrent updates to the log for the same batch ID



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-17 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-43541:
-
Description: 
This was tested on Spark 3.3.2 and Spark 3.4.0.

{code}
Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with 
name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
[`key`].; line 4, pos 7
{code}


FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get the 
query to work with any of these modifications. 


{code}
# -- FULL OUTER JOIN
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b USING (key)
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
[`key`].; line 4 pos 7
# -- INNER JOIN
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   JOIN gcp_pro_b USING (key)
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 0.507s
# -- NO Filter
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b USING (key);
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 1.021s
# -- ON instead of USING
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 0.514s
{code}


  was:
This was tested on Spark 3.3.2 and Spark 3.4.0.

{code}
Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with 
name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
[`key`].; line 4, pos 7
{code}


FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get the 
query to work with any of these modifications. 


{code}
# WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b USING (key)
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
[`key`].; line 4 pos 7
# -- INNER JOIN
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   JOIN gcp_pro_b USING (key)
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 0.507s
# -- NO Filter
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b USING (key);
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 1.021s
# -- ON instead of USING
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 0.514s
{code}



> Incorrect column resolution on FULL OUTER JOIN with USING
> -
>
> Key: SPARK-43541
> URL: https://issues.apache.org/jira/browse/SPARK-43541
> Project: Spark
>  Issue Type: Bug
>

[jira] [Created] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-17 Thread Max Gekk (Jira)
Max Gekk created SPARK-43541:


 Summary: Incorrect column resolution on FULL OUTER JOIN with USING
 Key: SPARK-43541
 URL: https://issues.apache.org/jira/browse/SPARK-43541
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.3.2
Reporter: Max Gekk
Assignee: Max Gekk


This was tested on Spark 3.3.2 and Spark 3.4.0.

{code}
Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with 
name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
[`key`].; line 4, pos 7
{code}


FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get the 
query to work with any of these modifications. 


{code}
# WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b USING (key)
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
[`key`].; line 4 pos 7
# -- INNER JOIN
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   JOIN gcp_pro_b USING (key)
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 0.507s
# -- NO Filter
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b USING (key);
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 1.021s
# -- ON instead of USING
   WITH
   aws_dbr_a AS (select key from values ('a') t(key)),
   gcp_pro_b AS (select key from values ('a') t(key))
   SELECT aws_dbr_a.key
   FROM aws_dbr_a
   FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
   WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
+-+
| key |
|-|
| a   |
+-+
1 row in set
Time: 0.514s
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4

2023-05-17 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43537.
--
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/41198

> Upgrade the asm deps in the tools module to 9.4
> ---
>
> Key: SPARK-43537
> URL: https://issues.apache.org/jira/browse/SPARK-43537
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4

2023-05-17 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43537:
-
Priority: Minor  (was: Major)

> Upgrade the asm deps in the tools module to 9.4
> ---
>
> Key: SPARK-43537
> URL: https://issues.apache.org/jira/browse/SPARK-43537
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Summary: Add working directory into classpath on the driver in K8S cluster 
mode  (was: Add current working directory into classpath on the driver in K8S 
cluster mode)

> Add working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, it places spark.files, spark.jars and spark.files under 
> the current working directory on the driver in K8S cluster mode. but the 
> spark.files and spark.jars seems are not accessible by the classloader.
>  
> we need to add the current working directory into classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Description: 
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, it places spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S cluster mode. but the 
spark.files and spark.jars seems are not accessible by the classloader.

 

we need to add the current working directory into classpath.

  was:
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, it places spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S cluster mode. but the 
spark.files and spark.jars seems are not accessible by the classloader.

 

we need to add the current working directory to classpath.


> Add current working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, it places spark.files, spark.jars and spark.files under 
> the current working directory on the driver in K8S cluster mode. but the 
> spark.files and spark.jars seems are not accessible by the classloader.
>  
> we need to add the current working directory into classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Description: 
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, it places spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S cluster mode. but the 
spark.files and spark.jars seems are not accessible by the classloader.

 

we need to add the current working directory to classpath.

  was:
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, for  Kubernetes cluster mode, it places 


> Add current working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, it places spark.files, spark.jars and spark.files under 
> the current working directory on the driver in K8S cluster mode. but the 
> spark.files and spark.jars seems are not accessible by the classloader.
>  
> we need to add the current working directory to classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43540:
-
Description: 
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

After SPARK-33782, for  Kubernetes cluster mode, it places 

  was:
In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

 


> Add current working directory into classpath on the driver in K8S cluster mode
> --
>
> Key: SPARK-43540
> URL: https://issues.apache.org/jira/browse/SPARK-43540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> In Yarn cluster modes, the passed files/jars are able to be accessed in the 
> classloader. Looks like this is not the case in Kubernetes cluster mode.
> After SPARK-33782, for  Kubernetes cluster mode, it places 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43540) Add current working directory into classpath on the driver in K8S cluster mode

2023-05-17 Thread Fei Wang (Jira)
Fei Wang created SPARK-43540:


 Summary: Add current working directory into classpath on the 
driver in K8S cluster mode
 Key: SPARK-43540
 URL: https://issues.apache.org/jira/browse/SPARK-43540
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Fei Wang


In Yarn cluster modes, the passed files/jars are able to be accessed in the 
classloader. Looks like this is not the case in Kubernetes cluster mode.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4

2023-05-17 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723478#comment-17723478
 ] 

GridGain Integration commented on SPARK-43537:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41198

> Upgrade the asm deps in the tools module to 9.4
> ---
>
> Key: SPARK-43537
> URL: https://issues.apache.org/jira/browse/SPARK-43537
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues

2023-05-17 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43535.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41184
[https://github.com/apache/spark/pull/41184]

> Adjust the ImportOrderChecker rule to resolve long-standing import issues
> -
>
> Key: SPARK-43535
> URL: https://issues.apache.org/jira/browse/SPARK-43535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues

2023-05-17 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43535:


Assignee: BingKun Pan

> Adjust the ImportOrderChecker rule to resolve long-standing import issues
> -
>
> Key: SPARK-43535
> URL: https://issues.apache.org/jira/browse/SPARK-43535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20

2023-05-17 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723433#comment-17723433
 ] 

Yuming Wang edited comment on SPARK-43538 at 5/17/23 12:36 PM:
---

Yes. I think so: https://github.com/Homebrew/homebrew-core/pull/131189


was (Author: q79969786):
Yes. I think so.

> Spark Homebrew Formulae currently depends on non-officially-supported Java 20
> -
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20

2023-05-17 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723433#comment-17723433
 ] 

Yuming Wang commented on SPARK-43538:
-

Yes. I think so.

> Spark Homebrew Formulae currently depends on non-officially-supported Java 20
> -
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20

2023-05-17 Thread Ghislain Fourny (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723430#comment-17723430
 ] 

Ghislain Fourny edited comment on SPARK-43538 at 5/17/23 12:04 PM:
---

Thanks, Yuming Wang! Does it mean the apache-spark Homebrew Formulae should 
then be adapted to openjdk@17 (or 8 or 11) to avoid unpredictable behavior?


was (Author: JIRAUSER300463):
Thanks, Yuming Wang! Does it mean the Homebrew Formulae should then be adapted 
to openjdk@17 (or 8 or 11) to avoid unpredictable behavior?

> Spark Homebrew Formulae currently depends on non-officially-supported Java 20
> -
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20

2023-05-17 Thread Ghislain Fourny (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723430#comment-17723430
 ] 

Ghislain Fourny commented on SPARK-43538:
-

Thanks, Yuming Wang! Does it mean the Homebrew Formulae should then be adapted 
to openjdk@17 (or 8 or 11) to avoid unpredictable behavior?

> Spark Homebrew Formulae currently depends on non-officially-supported Java 20
> -
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20

2023-05-17 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723429#comment-17723429
 ] 

Yuming Wang commented on SPARK-43538:
-

We have not tested on Java 20, because Java 20 is not LTS.

> Spark Homebrew Formulae currently depends on non-officially-supported Java 20
> -
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43539) Assign a name to the error class _LEGACY_ERROR_TEMP_0003

2023-05-17 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43539:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0003
 Key: SPARK-43539
 URL: https://issues.apache.org/jira/browse/SPARK-43539
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43536) Statsd sink reporter reports incorrect counter metrics.

2023-05-17 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723415#comment-17723415
 ] 

Ignite TC Bot commented on SPARK-43536:
---

User 'venkateshbalaji99' has created a pull request for this issue:
https://github.com/apache/spark/pull/41199

> Statsd sink reporter reports incorrect counter metrics.
> ---
>
> Key: SPARK-43536
> URL: https://issues.apache.org/jira/browse/SPARK-43536
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
>Reporter: Abhishek Modi
>Priority: Major
>
> There is a mismatch between the definition of counter metrics between 
> dropwizard (which  is used by spark) and statsD. While Dropwizard interprets 
> counters as cumulative metrics, statsD interprets them as delta metrics. This 
> causes double aggregation in statsd causing inconsistent metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20

2023-05-17 Thread Ghislain Fourny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ghislain Fourny updated SPARK-43538:

Summary: Spark Homebrew Formulae currently depends on 
non-officially-supported Java 20  (was: Spark Homebrew recipe currently depends 
on non-officially-supported Java 20)

> Spark Homebrew Formulae currently depends on non-officially-supported Java 20
> -
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43538) Spark Homebrew recipe currently depends on non-officially-supported Java 20

2023-05-17 Thread Ghislain Fourny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ghislain Fourny updated SPARK-43538:

Summary: Spark Homebrew recipe currently depends on 
non-officially-supported Java 20  (was: Spark Homebrew recipe falls back to 
non-officially-supported Java 20)

> Spark Homebrew recipe currently depends on non-officially-supported Java 20
> ---
>
> Key: SPARK-43538
> URL: https://issues.apache.org/jira/browse/SPARK-43538
> Project: Spark
>  Issue Type: Request
>  Components: Java API
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
> Environment: Homebrew (e.g., macOS)
>Reporter: Ghislain Fourny
>Priority: Minor
>
> I am not sure if homebrew-related issues can also be reported here? The 
> Homebrew formulae for apache-spark runs on (latest) openjdk 20.
> [https://formulae.brew.sh/formula/apache-spark]
> However, Apache Spark is documented to work with Java 8/11/17:
> [https://spark.apache.org/docs/latest/]
> Is this an overlook, or is Java 20 officially supported, too?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43538) Spark Homebrew recipe falls back to non-officially-supported Java 20

2023-05-17 Thread Ghislain Fourny (Jira)
Ghislain Fourny created SPARK-43538:
---

 Summary: Spark Homebrew recipe falls back to 
non-officially-supported Java 20
 Key: SPARK-43538
 URL: https://issues.apache.org/jira/browse/SPARK-43538
 Project: Spark
  Issue Type: Request
  Components: Java API
Affects Versions: 3.4.0, 3.3.2, 3.2.4
 Environment: Homebrew (e.g., macOS)
Reporter: Ghislain Fourny


I am not sure if homebrew-related issues can also be reported here? The 
Homebrew formulae for apache-spark runs on (latest) openjdk 20.

[https://formulae.brew.sh/formula/apache-spark]

However, Apache Spark is documented to work with Java 8/11/17:

[https://spark.apache.org/docs/latest/]

Is this an overlook, or is Java 20 officially supported, too?

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4

2023-05-17 Thread Yang Jie (Jira)
Yang Jie created SPARK-43537:


 Summary: Upgrade the asm deps in the tools module to 9.4
 Key: SPARK-43537
 URL: https://issues.apache.org/jira/browse/SPARK-43537
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43536) Statsd sink reporter reports incorrect counter metrics.

2023-05-17 Thread Abhishek Modi (Jira)
Abhishek Modi created SPARK-43536:
-

 Summary: Statsd sink reporter reports incorrect counter metrics.
 Key: SPARK-43536
 URL: https://issues.apache.org/jira/browse/SPARK-43536
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.3
Reporter: Abhishek Modi


There is a mismatch between the definition of counter metrics between 
dropwizard (which  is used by spark) and statsD. While Dropwizard interprets 
counters as cumulative metrics, statsD interprets them as delta metrics. This 
causes double aggregation in statsd causing inconsistent metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues

2023-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723389#comment-17723389
 ] 

ASF GitHub Bot commented on SPARK-43535:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41184

> Adjust the ImportOrderChecker rule to resolve long-standing import issues
> -
>
> Key: SPARK-43535
> URL: https://issues.apache.org/jira/browse/SPARK-43535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues

2023-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723390#comment-17723390
 ] 

ASF GitHub Bot commented on SPARK-43535:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41184

> Adjust the ImportOrderChecker rule to resolve long-standing import issues
> -
>
> Key: SPARK-43535
> URL: https://issues.apache.org/jira/browse/SPARK-43535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388
 ] 

caican edited comment on SPARK-43526 at 5/17/23 9:03 AM:
-

[~yumwang] 
Tpcds tests show performance gains for most queries and we plan to use 
shuffledHashJoin preferentially to eliminate sort consumption when the small 
table meets a certain threshold, but q95 in tpcds has a serious performance 
regression and we are not sure if it can be turned on by default.

with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

sortMergeJoin is preferred:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!


was (Author: JIRAUSER280464):
[~yumwang] 
Tpcds tests show performance gains for most queries and we plan to use 
shuffledHashJoin preferentially to eliminate sort consumption when the small 
table meets a certain threshold, but q95 in tpcds has a serious performance 
regression and we are not sure if it can be turned on by default.

with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388
 ] 

caican edited comment on SPARK-43526 at 5/17/23 9:02 AM:
-

[~yumwang] 
Tpcds tests show performance gains for most queries and we plan to use 
shuffledHashJoin preferentially to eliminate sort consumption when the small 
table meets a certain threshold, but q95 in tpcds has a serious performance 
regression and we are not sure if it can be turned on by default.

with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!


was (Author: JIRAUSER280464):
[~yumwang] 
We plan to use shuffledHashJoin preferentially to eliminate sort consumption 
when the small table meets a certain threshold, but q95 in tpcds has a serious 
performance regression and we are not sure if it can be turned on by default.



with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43523) Memory leak in Spark UI

2023-05-17 Thread Amine Bagdouri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amine Bagdouri updated SPARK-43523:
---
Description: 
We have a distributed Spark application running on Azure HDInsight using Spark 
version 2.4.4.

After a few days of active processing on our application, we have noticed that 
the GC CPU time ratio of the driver is close to 100%. We suspected a memory 
leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory 
Analyzer.

Here is some interesting data from the driver's heap dump (heap size is 8 GB):
 * The estimated retained heap size of String objects (~5M instances) is 3.3 
GB. It seems that most of these instances correspond to spark events.
 * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
 * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
there shouldn't be more than 16 live running jobs since we use a fixed size 
thread pool of 16 threads to run spark queries.
 * The number of LiveTask objects is 485K.
 * The AsyncEventQueue instance associated to the AppStatusListener has a value 
of 854 for dropped events count and a value of 10001 for total events count, 
knowing that the dropped events counter is reset every minute and that the 
queue's default capacity is 1.

We think that there is a memory leak in Spark UI. Here is our analysis of the 
root cause of this leak:
 * AppStatusListener is notified of Spark events using a bounded queue in 
AsyncEventQueue.
 * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
liveJobs, ...) based on the received events. For example, onTaskStart adds a 
task to liveTasks map and onTaskEnd removes the task from liveTasks map.
 * When the rate of events is very high, the bounded queue in AsyncEventQueue 
is full, some events are dropped and don't make it to AppStatusListener.
 * Dropped events that signal the end of a processing unit prevent the state of 
AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, 
will prevent the task from being removed from liveTasks map, and the task will 
remain in the heap until the driver's JVM is stopped.

We were able to confirm our analysis by reducing the capacity of the 
AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
having launched many spark queries using this config, we observed that the 
number of active jobs in Spark UI increased rapidly and remained high even 
though all submitted queries have completed. We have also noticed that some 
executor task counters in Spark UI were negative, which confirms that 
AppStatusListener state does not accurately reflect the reality and that it can 
be a victim of event drops.

Suggested fix:
There are some limits today on the number of "dead" objects in 
AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
enforcing another configurable limit on the number of total objects in 
AppStatusListener's maps and kvstore. This should limit the leak in the case of 
high events rate, but AppStatusListener stats will remain inaccurate.

  was:
We have a distributed Spark application running on Azure HDInsight using Spark 
version 2.4.4.

After a few days of active processing on our application, we have noticed that 
the GC CPU time ratio of the driver is close to 100%. We suspected a memory 
leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory 
Analyzer.

Here is some interesting data from the driver's heap dump (heap size is 8 GB):
 * The estimated retained heap size of String objects (~5M instances) is 3.3 
GB. It seems that most of these instances correspond to spark events.
 * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
 * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
there shouldn't be more than 16 live running jobs since we use a fixed thread 
pool of 16 threads to run spark queries.
 * The number of LiveTask objects is 485K.
 * The AsyncEventQueue instance associated to the AppStatusListener has a value 
of 854 for dropped events count and a value of 10001 for total events count, 
knowing that the dropped events counter is reset every minute and that the 
queue's default capacity is 1.

We think that there is a memory leak in Spark UI. Here is our analysis of the 
root cause of this leak:
 * AppStatusListener is notified of Spark events using a bounded queue in 
AsyncEventQueue.
 * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
liveJobs, ...) based on the received events. For example, onTaskStart adds a 
task to liveTasks map and onTaskEnd removes the task from liveTasks map.
 * When the rate of events is very high, the bounded queue in AsyncEventQueue 
is full, some events are dropped and don't make it to AppStatusListener.
 * Dropped events that signal the end of a processing unit prevent the state of

[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388
 ] 

caican commented on SPARK-43526:


[~yumwang] 
We plan to use shuffledHashJoin preferentially to eliminate sort consumption 
when the small table meets a certain threshold, but q95 in tpcds has a serious 
performance regression and we are not sure if it can be turned on by default.



with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-17-16-54-59-053.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-17-16-53-42-302.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-17 Thread Svyatoslav Semenyuk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723216#comment-17723216
 ] 

Svyatoslav Semenyuk edited comment on SPARK-43514 at 5/17/23 8:35 AM:
--

-We applied "current workaround" to application code and this does not solve 
the issue.-
UPD: issue was resolved in application by calling `.cache()` DF method.


was (Author: JIRAUSER300434):
~We applied "current workaround" to application code and this does not solve 
the issue.~
UPD: issue was resolved in application by calling `.cache()` DF method.

> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features caused by certain SQL functions
> --
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.2, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.2 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caus

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-17 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Description: 
We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added {{show}} calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input data frames have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a data frame to the column of resulting 
dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one data frame:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions._


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val toBeMatched2 = directors
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "directorId",
)

similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6)
}

def process2(df: DataFrame): Unit = {
def process_financials(column: Column): Column = {
transform(
column,
x => x.withField("date", to_timestamp(x("date"), "dd MMM ")),
)
}

   

[jira] [Comment Edited] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-17 Thread Svyatoslav Semenyuk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723216#comment-17723216
 ] 

Svyatoslav Semenyuk edited comment on SPARK-43514 at 5/17/23 8:34 AM:
--

~We applied "current workaround" to application code and this does not solve 
the issue.~
UPD: issue was resolved in application by calling `.cache()` DF method.


was (Author: JIRAUSER300434):
We applied "current workaround" to application code and this does not solve the 
issue.

> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features caused by certain SQL functions
> --
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.2, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.2 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must have 

[jira] [Commented] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723373#comment-17723373
 ] 

Yuming Wang commented on SPARK-43534:
-

https://github.com/apache/spark/pull/41195

> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: hadoop log jars.png, log4j-1.2-api-2.20.0.jar, 
> log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
> Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default:
> {noformat}
> jars/log4j-1.2-api-2.20.0.jar
> jars/log4j-slf4j2-impl-2.20.0.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
> appender.file.append = true
> appender.file.layout.type = PatternLayout
> appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
> %m%n
> appender.file.policies.type = Policies
> appender.file.policies.time.type = TimeBasedTriggeringPolicy
> appender.file.policies.time.interval = 1
> appender.file.policies.time.modulate = true
> appender.file.policies.size.type = SizeBasedTriggeringPolicy
> appender.file.policies.size.size = 256M
> appender.file.strategy.type = DefaultRolloverStrategy
> appender.file.strategy.max = 100
> {code}
> Start Spark thriftserver:
> {code:java}
> sbin/start-thriftserver.sh
> {code}
> Check the log:
> {code:sh}
> cat /tmp/spark/logs/spark.log
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43534:

Attachment: hadoop log jars.png

> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: hadoop log jars.png, log4j-1.2-api-2.20.0.jar, 
> log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
> Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default:
> {noformat}
> jars/log4j-1.2-api-2.20.0.jar
> jars/log4j-slf4j2-impl-2.20.0.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
> appender.file.append = true
> appender.file.layout.type = PatternLayout
> appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
> %m%n
> appender.file.policies.type = Policies
> appender.file.policies.time.type = TimeBasedTriggeringPolicy
> appender.file.policies.time.interval = 1
> appender.file.policies.time.modulate = true
> appender.file.policies.size.type = SizeBasedTriggeringPolicy
> appender.file.policies.size.size = 256M
> appender.file.strategy.type = DefaultRolloverStrategy
> appender.file.strategy.max = 100
> {code}
> Start Spark thriftserver:
> {code:java}
> sbin/start-thriftserver.sh
> {code}
> Check the log:
> {code:sh}
> cat /tmp/spark/logs/spark.log
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43534:

Description: 
Build Spark:
{code:sh}
./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default:
{noformat}
jars/log4j-1.2-api-2.20.0.jar
jars/log4j-slf4j2-impl-2.20.0.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}



  was:
Build Spark:
{code:sh}
./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
Remove the following jars to spark-3.5.0-SNAPSHOT-bin-default:
{noformat}
jars/log4j-1.2-api-2.20.0.jar
jars/log4j-slf4j2-impl-2.20.0.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}




> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
> Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default:
> {noformat}
> jars/log4j-1.2-api-2.20.0.jar
> jars/log4j-slf4j2-impl-2.20.0.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
> appender.file.append = true
> appender.file.layout.type = PatternLa

[jira] [Created] (SPARK-43535) Adjust the ImportOrderChecker rule to resolve long-standing import issues

2023-05-17 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43535:
---

 Summary: Adjust the ImportOrderChecker rule to resolve 
long-standing import issues
 Key: SPARK-43535
 URL: https://issues.apache.org/jira/browse/SPARK-43535
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43128) Streaming progress struct (especially in Scala)

2023-05-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-43128.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40892
[https://github.com/apache/spark/pull/40892]

> Streaming progress struct (especially in Scala)
> ---
>
> Key: SPARK-43128
> URL: https://issues.apache.org/jira/browse/SPARK-43128
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> Streaming spark connect transfers streaming progress as full “json”.
> This works ok for Python since it does not have any schema defined. 
> But in Scala, it is a full fledged class. We need to decide if we want to 
> match legacy Progress struct in spark-connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43128) Streaming progress struct (especially in Scala)

2023-05-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-43128:


Assignee: Yang Jie

> Streaming progress struct (especially in Scala)
> ---
>
> Key: SPARK-43128
> URL: https://issues.apache.org/jira/browse/SPARK-43128
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Yang Jie
>Priority: Major
>
> Streaming spark connect transfers streaming progress as full “json”.
> This works ok for Python since it does not have any schema defined. 
> But in Scala, it is a full fledged class. We need to decide if we want to 
> match legacy Progress struct in spark-connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43534:

Description: 
Build Spark:
{code:sh}
./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
Remove the following jars to spark-3.5.0-SNAPSHOT-bin-default:
{noformat}
jars/log4j-1.2-api-2.20.0.jar
jars/log4j-slf4j2-impl-2.20.0.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}



  was:
Build Spark:
{code:sh}
./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
{noformat}
guava-14.0.1.jar
hadoop-client-api-3.3.5.jar
hadoop-client-runtime-3.3.5.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.5.jar
slf4j-api-2.0.7.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}




> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
> Remove the following jars to spark-3.5.0-SNAPSHOT-bin-default:
> {noformat}
> jars/log4j-1.2-api-2.20.0.jar
> jars/log4j-slf4j2-impl-2.20.0.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/

[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43534:

Attachment: log4j-slf4j2-impl-2.20.0.jar

> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
> Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
> {noformat}
> guava-14.0.1.jar
> hadoop-client-api-3.3.5.jar
> hadoop-client-runtime-3.3.5.jar
> hadoop-shaded-guava-1.1.1.jar
> hadoop-yarn-server-web-proxy-3.3.5.jar
> slf4j-api-2.0.7.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
> appender.file.append = true
> appender.file.layout.type = PatternLayout
> appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
> %m%n
> appender.file.policies.type = Policies
> appender.file.policies.time.type = TimeBasedTriggeringPolicy
> appender.file.policies.time.interval = 1
> appender.file.policies.time.modulate = true
> appender.file.policies.size.type = SizeBasedTriggeringPolicy
> appender.file.policies.size.size = 256M
> appender.file.strategy.type = DefaultRolloverStrategy
> appender.file.strategy.max = 100
> {code}
> Start Spark thriftserver:
> {code:java}
> sbin/start-thriftserver.sh
> {code}
> Check the log:
> {code:sh}
> cat /tmp/spark/logs/spark.log
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43534:

Attachment: log4j-1.2-api-2.20.0.jar

> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: log4j-1.2-api-2.20.0.jar, log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
> Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
> {noformat}
> guava-14.0.1.jar
> hadoop-client-api-3.3.5.jar
> hadoop-client-runtime-3.3.5.jar
> hadoop-shaded-guava-1.1.1.jar
> hadoop-yarn-server-web-proxy-3.3.5.jar
> slf4j-api-2.0.7.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
> appender.file.append = true
> appender.file.layout.type = PatternLayout
> appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
> %m%n
> appender.file.policies.type = Policies
> appender.file.policies.time.type = TimeBasedTriggeringPolicy
> appender.file.policies.time.interval = 1
> appender.file.policies.time.modulate = true
> appender.file.policies.size.type = SizeBasedTriggeringPolicy
> appender.file.policies.size.size = 256M
> appender.file.strategy.type = DefaultRolloverStrategy
> appender.file.strategy.max = 100
> {code}
> Start Spark thriftserver:
> {code:java}
> sbin/start-thriftserver.sh
> {code}
> Check the log:
> {code:sh}
> cat /tmp/spark/logs/spark.log
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43534:

Description: 
Build Spark:
{code:sh}
./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
{noformat}
guava-14.0.1.jar
hadoop-client-api-3.3.5.jar
hadoop-client-runtime-3.3.5.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.5.jar
slf4j-api-2.0.7.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}



  was:
Build Spark:
{code:sh}
./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
Copy the fellowing jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
{noformat}
guava-14.0.1.jar
hadoop-client-api-3.3.5.jar
hadoop-client-runtime-3.3.5.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.5.jar
slf4j-api-2.0.7.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}




> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
> Copy the following jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
> {noformat}
> guava-14.0.1.jar
> hadoop-client-api-3.3.5.jar
> hadoop-client-runtime-3.3.5.jar
> hadoop-shaded-guava-1.1.1.jar
> hadoop-yarn-server-web-proxy-3.3.5.jar
> slf4j-api-2.0.7.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> a

[jira] [Created] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-17 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-43534:
---

 Summary: Add log4j-1.2-api and log4j-slf4j2-impl to classpath if 
active hadoop-provided
 Key: SPARK-43534
 URL: https://issues.apache.org/jira/browse/SPARK-43534
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.0
Reporter: Yuming Wang


Build Spark:
{code:sh}
./dev/make-distribution.sh --name provided --tgz -Phive -Phive-thriftserver 
-Pyarn -Phadoop-provided
tar -zxf spark-3.5.0-SNAPSHOT-bin-provided.tgz {code}
Copy the fellowing jars to spark-3.5.0-SNAPSHOT-bin-provided/jars/:
{noformat}
guava-14.0.1.jar
hadoop-client-api-3.3.5.jar
hadoop-client-runtime-3.3.5.jar
hadoop-shaded-guava-1.1.1.jar
hadoop-yarn-server-web-proxy-3.3.5.jar
slf4j-api-2.0.7.jar
{noformat}
Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-provided/conf:
{code:none}
rootLogger.level = info
rootLogger.appenderRef.file.ref = File
rootLogger.appenderRef.stderr.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
%m%n

appender.file.type = RollingFile
appender.file.name = File
appender.file.fileName = /tmp/spark/logs/spark.log
appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
appender.file.append = true
appender.file.layout.type = PatternLayout
appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : %m%n
appender.file.policies.type = Policies
appender.file.policies.time.type = TimeBasedTriggeringPolicy
appender.file.policies.time.interval = 1
appender.file.policies.time.modulate = true
appender.file.policies.size.type = SizeBasedTriggeringPolicy
appender.file.policies.size.size = 256M
appender.file.strategy.type = DefaultRolloverStrategy
appender.file.strategy.max = 100
{code}

Start Spark thriftserver:
{code:java}
sbin/start-thriftserver.sh
{code}

Check the log:
{code:sh}
cat /tmp/spark/logs/spark.log
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org