[jira] [Updated] (SPARK-38162) Optimize one row plan in normal and AQE Optimizer

2022-02-09 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38162:
--
Summary: Optimize one row plan in normal and AQE Optimizer  (was: Optimize 
one max row plan in normal and AQE Optimizer)

> Optimize one row plan in normal and AQE Optimizer
> -
>
> Key: SPARK-38162
> URL: https://issues.apache.org/jira/browse/SPARK-38162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Optimize the plan if its max row is equal to or less than 1 in these cases:
>  * if sort max rows less than or equal to 1, remove the sort
>  * if local sort max rows per partition less than or equal to 1, remove the 
> local sort
>  * if aggregate max rows less than or equal to 1 and it's grouping only, 
> remove the aggregate
>  * if aggregate max rows less than or equal to 1, set distinct to false in 
> all aggregate expression



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38162) Optimize one max row plan in normal and AQE Optimizer

2022-02-09 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38162:
--
Description: 
Optimize the plan if its max row is equal to or less than 1 in these cases:
 * if sort max rows less than or equal to 1, remove the sort
 * if local sort max rows per partition less than or equal to 1, remove the 
local sort
 * if aggregate max rows less than or equal to 1 and it's grouping only, remove 
the aggregate
 * if aggregate max rows less than or equal to 1, set distinct to false in all 
aggregate expression

  was:We can not propagate empty through aggregate if it does not contain 
grouping expression. But for the aggregate which contains distinct aggregate 
expression, we can remove distinct if its child is empty.


> Optimize one max row plan in normal and AQE Optimizer
> -
>
> Key: SPARK-38162
> URL: https://issues.apache.org/jira/browse/SPARK-38162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Optimize the plan if its max row is equal to or less than 1 in these cases:
>  * if sort max rows less than or equal to 1, remove the sort
>  * if local sort max rows per partition less than or equal to 1, remove the 
> local sort
>  * if aggregate max rows less than or equal to 1 and it's grouping only, 
> remove the aggregate
>  * if aggregate max rows less than or equal to 1, set distinct to false in 
> all aggregate expression



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38162) Optimize one max row plan in normal and AQE Optimizer

2022-02-09 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38162:
--
Summary: Optimize one max row plan in normal and AQE Optimizer  (was: 
Remove distinct in aggregate if its child is empty)

> Optimize one max row plan in normal and AQE Optimizer
> -
>
> Key: SPARK-38162
> URL: https://issues.apache.org/jira/browse/SPARK-38162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> We can not propagate empty through aggregate if it does not contain grouping 
> expression. But for the aggregate which contains distinct aggregate 
> expression, we can remove distinct if its child is empty.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38157:
---

Assignee: Xinyi Yu

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38157.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35471
[https://github.com/apache/spark/pull/35471]

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.3.0
>
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489973#comment-17489973
 ] 

Hyukjin Kwon commented on SPARK-38161:
--

no I meant the API in Scala itself:

{code}
scala> Seq(1, 2, 3).partition(_ % 1 == 0)
res3: (Seq[Int], Seq[Int]) = (List(1, 2, 3),List())
{code}

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489971#comment-17489971
 ] 

gaokui commented on SPARK-38161:


do you mean that is 'wirte.mode('parquet').patitionby('col')' or df.partion?

could you provide more detail? 

ths!

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Dooyoung Hwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang resolved SPARK-38168.

Resolution: Won't Fix

> LikeSimplification handles escape character
> ---
>
> Key: SPARK-38168
> URL: https://issues.apache.org/jira/browse/SPARK-38168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dooyoung Hwang
>Priority: Major
>
> Currently, LikeSimplification rule of catalyst is skipped if the pattern 
> contains escape character.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
> +- Relation[c_1#0,c_2#1,c_3#2] ...
> {noformat}
> The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
> StringType.
> LikeSimplification rule can consider a special character(wildcard(%, _) or 
> escape character) as a plain character if the character follows an escape 
> character.
> By doing that, LikeSimplification rule can optimize the filter like below.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
> +- Relation[c_1#0,c_2#1,c_3#2] 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38165) private classes fail at runtime in scala 2.12.13+

2022-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489965#comment-17489965
 ] 

Hyukjin Kwon commented on SPARK-38165:
--

Okay, so I guess the problem is here: 
https://github.com/everson/spark-codegen-bug/blob/main/src/test/scala/everson/sparkcodegen/FunctionsSpec.scala#L45

The companion object should be able to be accessed by the instance but I guess 
it fails. I think it's sort of a corner case in any event.

> private classes fail at runtime in scala 2.12.13+
> -
>
> Key: SPARK-38165
> URL: https://issues.apache.org/jira/browse/SPARK-38165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: Tested in using JVM 8, 11 on scala versions 2.12.12 
> (works), 12.12.13 to 12.12.15 and 12.13.7 to 12.13.8
>Reporter: Johnny Everson
>Priority: Major
>
> h2. reproduction steps
> {code:java}
> git clone g...@github.com:everson/spark-codegen-bug.git
> sbt +test
> {code}
> h2. problem
> Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) 
> referring to case classes members fail at runtime.
> See discussion on [https://github.com/scala/bug/issues/12533] for exact 
> internal details from scala contributors, but the gist that starting with 
> Scala 2.12.13, inner classes visibility rules changed via 
> https://github.com/scala/scala/pull/9131 and it appears that Spark CodeGen 
> assumes they are public.
> In a complex project, the error looks like:
> {code:java}
> [error]
> Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent 
> failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): 
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 63, Column 8: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 63, Column 8: Private member cannot be accessed from type 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection".
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> [error]   at 
> org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274)
> [error]   at 
> org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147)
> [error]   at 
> org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326)
> [error]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> [error]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [error]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> 

[jira] [Updated] (SPARK-38165) private classes fail at runtime in scala 2.12.13+

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38165:
-
Priority: Minor  (was: Major)

> private classes fail at runtime in scala 2.12.13+
> -
>
> Key: SPARK-38165
> URL: https://issues.apache.org/jira/browse/SPARK-38165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: Tested in using JVM 8, 11 on scala versions 2.12.12 
> (works), 12.12.13 to 12.12.15 and 12.13.7 to 12.13.8
>Reporter: Johnny Everson
>Priority: Minor
>
> h2. reproduction steps
> {code:java}
> git clone g...@github.com:everson/spark-codegen-bug.git
> sbt +test
> {code}
> h2. problem
> Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) 
> referring to case classes members fail at runtime.
> See discussion on [https://github.com/scala/bug/issues/12533] for exact 
> internal details from scala contributors, but the gist that starting with 
> Scala 2.12.13, inner classes visibility rules changed via 
> https://github.com/scala/scala/pull/9131 and it appears that Spark CodeGen 
> assumes they are public.
> In a complex project, the error looks like:
> {code:java}
> [error]
> Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent 
> failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): 
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 63, Column 8: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 63, Column 8: Private member cannot be accessed from type 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection".
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> [error]   at 
> org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274)
> [error]   at 
> org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147)
> [error]   at 
> org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326)
> [error]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> [error]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [error]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [error]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [error]   at 

[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489956#comment-17489956
 ] 

Hyukjin Kwon commented on SPARK-38161:
--

I guess you want an API like partition. Even if we implement this within Spark, 
the actual implementation would be exactly same as:

{code}
df1= dataframe.filter("coloumn=null")
df2= dataframe.filter("coloumn=!null")
{code}



> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38161:
-
Component/s: SQL
 (was: Block Manager)

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
split_part(str, delimiter, partNum)
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the end of the 
string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

h6. Examples
{code:java}
> SELECT _FUNC_('11.12.13', '.', 3);
13
> SELECT _FUNC_(NULL, '.', 3);
NULL
> SELECT _FUNC_('11.12.13', '', 1);
'11.12.13'
{code}






  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
split_part(str, delimiter, partNum)
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the end of the 
string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

> SELECT _FUNC_('11.12.13', '.', 3);
13
> SELECT _FUNC_(NULL, '.', 3);
NULL
> SELECT _FUNC_('11.12.13', '', 1);
'11.12.13'
{code}







> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> h6. Syntax
> {code:java}
> split_part(str, delimiter, partNum)
> {code}
> h6. Arguments
> {code:java}
> str: string type
> delimiter: string type
> partNum: Integer type
> {code}
> h6. Note
> {code:java}
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the end of 
> the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> h6. Examples
> {code:java}
> > SELECT _FUNC_('11.12.13', '.', 3);
> 13
> > SELECT _FUNC_(NULL, '.', 3);
> NULL
> > SELECT _FUNC_('11.12.13', '', 1);
> '11.12.13'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
split_part(str, delimiter, partNum)
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the end of the 
string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

h6.Examples:
{code:java}
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
{code}






  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
`split_part(str, delimiter, partNum)`
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

h6.Examples:
{code:java}
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
{code}







> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> h6. Syntax
> {code:java}
> split_part(str, delimiter, partNum)
> {code}
> h6. Arguments
> {code:java}
> str: string type
> delimiter: string type
> partNum: Integer type
> {code}
> h6. Note
> {code:java}
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the end of 
> the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> h6.Examples:
> {code:java}
>   > SELECT _FUNC_('11.12.13', '.', 3);
>13
>   > SELECT _FUNC_(NULL, '.', 3);
>   NULL
>   > SELECT _FUNC_('11.12.13', '', 1);
>   '11.12.13'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
split_part(str, delimiter, partNum)
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the end of the 
string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

> SELECT _FUNC_('11.12.13', '.', 3);
13
> SELECT _FUNC_(NULL, '.', 3);
NULL
> SELECT _FUNC_('11.12.13', '', 1);
'11.12.13'
{code}






  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
split_part(str, delimiter, partNum)
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the end of the 
string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

h6.Examples:
{code:java}
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
{code}







> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> h6. Syntax
> {code:java}
> split_part(str, delimiter, partNum)
> {code}
> h6. Arguments
> {code:java}
> str: string type
> delimiter: string type
> partNum: Integer type
> {code}
> h6. Note
> {code:java}
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the end of 
> the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> > SELECT _FUNC_('11.12.13', '.', 3);
> 13
> > SELECT _FUNC_(NULL, '.', 3);
> NULL
> > SELECT _FUNC_('11.12.13', '', 1);
> '11.12.13'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton

h6. Syntax

{code:java}
`split_part(str, delimiter, partNum)`
{code}

h6. Arguments
{code:java}
str: string type
delimiter: string type
partNum: Integer type
{code}

h6. Note
{code:java}
1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}

h6.Examples:
{code:java}
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
{code}






  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton


{code:java}

`split_part(str, delimiter, partNum)`

str: string type
delimiter: string type
partNum: Integer type

1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 

Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
```
{code}






> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> h6. Syntax
> {code:java}
> `split_part(str, delimiter, partNum)`
> {code}
> h6. Arguments
> {code:java}
> str: string type
> delimiter: string type
> partNum: Integer type
> {code}
> h6. Note
> {code:java}
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the
>   end of the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> h6.Examples:
> {code:java}
>   > SELECT _FUNC_('11.12.13', '.', 3);
>13
>   > SELECT _FUNC_(NULL, '.', 3);
>   NULL
>   > SELECT _FUNC_('11.12.13', '', 1);
>   '11.12.13'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton


{code:java}

`split_part(str, delimiter, partNum)`

str: string type
delimiter: string type
partNum: Integer type

1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 

Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
```
{code}





  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton


{code:java}

`split_part(str, delimiter, partNum)`

str: string type
delimiter: string type
partNum: Integer type

1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}


Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
```



> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> {code:java}
> `split_part(str, delimiter, partNum)`
> str: string type
> delimiter: string type
> partNum: Integer type
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the
>   end of the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> Examples:
> ```
>   > SELECT _FUNC_('11.12.13', '.', 3);
>13
>   > SELECT _FUNC_(NULL, '.', 3);
>   NULL
>   > SELECT _FUNC_('11.12.13', '', 1);
>   '11.12.13'
> ```
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



h5. Function Specificaiton


{code:java}

`split_part(str, delimiter, partNum)`

str: string type
delimiter: string type
partNum: Integer type

1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}


Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
```


  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



The following demonstrates more about the new function:


{code:java}

`split_part(str, delimiter, partNum)`

str: string type
delimiter: string type
partNum: Integer type

1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}


Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
```



> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> {code:java}
> `split_part(str, delimiter, partNum)`
> str: string type
> delimiter: string type
> partNum: Integer type
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the
>   end of the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> Examples:
> ```
>   > SELECT _FUNC_('11.12.13', '.', 3);
>13
>   > SELECT _FUNC_(NULL, '.', 3);
>   NULL
>   > SELECT _FUNC_('11.12.13', '', 1);
>   '11.12.13'
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38063) Support SQL split_part function

2022-02-09 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-38063:
-
Description: 
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



The following demonstrates more about the new function:


{code:java}

`split_part(str, delimiter, partNum)`

str: string type
delimiter: string type
partNum: Integer type

1. This function splits `str` by `delimiter` and return requested part of the 
split (1-based). 
2. If any input parameter is NULL, return NULL.
3. If  the index is out of range of split parts, returns null.
4. If `partNum` is 0, throws an error.
5. If `partNum` is negative, the parts are counted backward from the
  end of the string
6. when delimiter is empty, str is considered not split thus there is just 1 
split part. 
{code}


Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
  > SELECT _FUNC_('11.12.13', '', 1);
  '11.12.13'
```


  was:
`split_part()` is a commonly supported function by other systems such as 
Postgres and some other systems. The Spark equivalent  is 
`element_at(split(arg, delim), part)`



The following demonstrates more about the new function:

`split_part(str, delimiter, partNum)`

This function splits `str` by `delimiter` and return requested part of the 
split (1-based). If any input parameter is NULL, return NULL.

`str` and `delimiter` are the same type as `string`. `partNum` is `integer` type

Examples:
```
  > SELECT _FUNC_('11.12.13', '.', 3);
   13
  > SELECT _FUNC_(NULL, '.', 3);
  NULL
```



> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> The following demonstrates more about the new function:
> {code:java}
> `split_part(str, delimiter, partNum)`
> str: string type
> delimiter: string type
> partNum: Integer type
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If  the index is out of range of split parts, returns null.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the
>   end of the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> Examples:
> ```
>   > SELECT _FUNC_('11.12.13', '.', 3);
>13
>   > SELECT _FUNC_(NULL, '.', 3);
>   NULL
>   > SELECT _FUNC_('11.12.13', '', 1);
>   '11.12.13'
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38170) Fix //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7 in ANSI

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38170:


Assignee: Apache Spark

> Fix 
> //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7
>  in ANSI
> ---
>
> Key: SPARK-38170
> URL: https://issues.apache.org/jira/browse/SPARK-38170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38170) Fix //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7 in ANSI

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38170:


Assignee: (was: Apache Spark)

> Fix 
> //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7
>  in ANSI
> ---
>
> Key: SPARK-38170
> URL: https://issues.apache.org/jira/browse/SPARK-38170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38170) Fix //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7 in ANSI

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489952#comment-17489952
 ] 

Apache Spark commented on SPARK-38170:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/35472

> Fix 
> //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7
>  in ANSI
> ---
>
> Key: SPARK-38170
> URL: https://issues.apache.org/jira/browse/SPARK-38170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38146) UDAF fails to aggregate TIMESTAMP_NTZ column

2022-02-09 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38146:
--

Assignee: Bruce Robbins

> UDAF fails to aggregate TIMESTAMP_NTZ column
> 
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38146) UDAF fails to aggregate TIMESTAMP_NTZ column

2022-02-09 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38146.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35470
[https://github.com/apache/spark/pull/35470]

> UDAF fails to aggregate TIMESTAMP_NTZ column
> 
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.3.0
>
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38170) Fix //sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7 in ANSI

2022-02-09 Thread Rui Wang (Jira)
Rui Wang created SPARK-38170:


 Summary: Fix 
//sql/hive-thriftserver:org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite-hive-2.3__hadoop-2.7
 in ANSI
 Key: SPARK-38170
 URL: https://issues.apache.org/jira/browse/SPARK-38170
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38165) private classes fail at runtime in scala 2.12.13+

2022-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489945#comment-17489945
 ] 

Hyukjin Kwon commented on SPARK-38165:
--

Interesting. Spark already uses Scala 2.12.15 even in Spark 3.2.1.

> private classes fail at runtime in scala 2.12.13+
> -
>
> Key: SPARK-38165
> URL: https://issues.apache.org/jira/browse/SPARK-38165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: Tested in using JVM 8, 11 on scala versions 2.12.12 
> (works), 12.12.13 to 12.12.15 and 12.13.7 to 12.13.8
>Reporter: Johnny Everson
>Priority: Major
>
> h2. reproduction steps
> {code:java}
> git clone g...@github.com:everson/spark-codegen-bug.git
> sbt +test
> {code}
> h2. problem
> Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) 
> referring to case classes members fail at runtime.
> See discussion on [https://github.com/scala/bug/issues/12533] for exact 
> internal details from scala contributors, but the gist that starting with 
> Scala 2.12.13, inner classes visibility rules changed via 
> https://github.com/scala/scala/pull/9131 and it appears that Spark CodeGen 
> assumes they are public.
> In a complex project, the error looks like:
> {code:java}
> [error]
> Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent 
> failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): 
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 63, Column 8: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 63, Column 8: Private member cannot be accessed from type 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection".
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
> [error]   at 
> org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> [error]   at 
> org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> [error]   at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277)
> [error]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274)
> [error]   at 
> org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147)
> [error]   at 
> org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326)
> [error]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> [error]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [error]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [error]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [error]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error]   at 
> 

[jira] [Resolved] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38163.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35467
[https://github.com/apache/spark/pull/35467]

> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.
> The example below demonstrates the issue:
> {code:scala}
> scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
> org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
> null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38113) Use error classes in the execution errors of pivoting

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38113:


Assignee: (was: Apache Spark)

> Use error classes in the execution errors of pivoting
> -
>
> Key: SPARK-38113
> URL: https://issues.apache.org/jira/browse/SPARK-38113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * repeatedPivotsUnsupportedError
> * pivotNotAfterGroupByUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38113) Use error classes in the execution errors of pivoting

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38113:


Assignee: Apache Spark

> Use error classes in the execution errors of pivoting
> -
>
> Key: SPARK-38113
> URL: https://issues.apache.org/jira/browse/SPARK-38113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * repeatedPivotsUnsupportedError
> * pivotNotAfterGroupByUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38113) Use error classes in the execution errors of pivoting

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489920#comment-17489920
 ] 

Apache Spark commented on SPARK-38113:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/35466

> Use error classes in the execution errors of pivoting
> -
>
> Key: SPARK-38113
> URL: https://issues.apache.org/jira/browse/SPARK-38113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * repeatedPivotsUnsupportedError
> * pivotNotAfterGroupByUnsupportedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37406) Inline type hints for python/pyspark/ml/fpm.py

2022-02-09 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37406.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35407
https://github.com/apache/spark/pull/35407

> Inline type hints for python/pyspark/ml/fpm.py
> --
>
> Key: SPARK-37406
> URL: https://issues.apache.org/jira/browse/SPARK-37406
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/fpm.pyi to python/pyspark/ml/fpm.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38157:


Assignee: Apache Spark

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Apache Spark
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489910#comment-17489910
 ] 

Apache Spark commented on SPARK-38157:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35471

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38157:


Assignee: (was: Apache Spark)

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489909#comment-17489909
 ] 

Apache Spark commented on SPARK-38157:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35471

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37406) Inline type hints for python/pyspark/ml/fpm.py

2022-02-09 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37406:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/ml/fpm.py
> --
>
> Key: SPARK-37406
> URL: https://issues.apache.org/jira/browse/SPARK-37406
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/fpm.pyi to python/pyspark/ml/fpm.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38164) New SQL function: try_subtract and try_multiply

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38164.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35461
[https://github.com/apache/spark/pull/35461]

> New SQL function: try_subtract and try_multiply
> ---
>
> Key: SPARK-38164
> URL: https://issues.apache.org/jira/browse/SPARK-38164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38139) ml.recommendation.ALS doctests failures

2022-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489884#comment-17489884
 ] 

Hyukjin Kwon commented on SPARK-38139:
--

yup agree

> ml.recommendation.ALS doctests failures
> ---
>
> Key: SPARK-38139
> URL: https://issues.apache.org/jira/browse/SPARK-38139
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> In my dev setups, ml.recommendation:ALS test consistently converges to value 
> lower than expected and fails with:
> {code:python}
> File "/path/to/spark/python/pyspark/ml/recommendation.py", line 322, in 
> __main__.ALS
> Failed example:
> predictions[0]
> Expected:
> Row(user=0, item=2, newPrediction=0.69291...)
> Got:
> Row(user=0, item=2, newPrediction=0.6929099559783936)
> {code}
> In can correct for that, but it creates some noise, so if anyone else 
> experiences this, we could drop  a digit from the results
> {code}
> diff --git a/python/pyspark/ml/recommendation.py 
> b/python/pyspark/ml/recommendation.py
> index f0628fb922..b8e2a6097d 100644
> --- a/python/pyspark/ml/recommendation.py
> +++ b/python/pyspark/ml/recommendation.py
> @@ -320,7 +320,7 @@ class ALS(JavaEstimator, _ALSParams, JavaMLWritable, 
> JavaMLReadable):
>  >>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", 
> "item"])
>  >>> predictions = sorted(model.transform(test).collect(), key=lambda r: 
> r[0])
>  >>> predictions[0]
> -Row(user=0, item=2, newPrediction=0.69291...)
> +Row(user=0, item=2, newPrediction=0.6929...)
>  >>> predictions[1]
>  Row(user=1, item=0, newPrediction=3.47356...)
>  >>> predictions[2]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489880#comment-17489880
 ] 

Dongjoon Hyun commented on SPARK-37814:
---

[~kabhwan]. Yes, it could be. Let's wait and see the way of the actual Apache 
Hadoop release.

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38146) UDAF fails to aggregate TIMESTAMP_NTZ column

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38146:


Assignee: (was: Apache Spark)

> UDAF fails to aggregate TIMESTAMP_NTZ column
> 
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38146) UDAF fails to aggregate TIMESTAMP_NTZ column

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38146:


Assignee: Apache Spark

> UDAF fails to aggregate TIMESTAMP_NTZ column
> 
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38146) UDAF fails to aggregate TIMESTAMP_NTZ column

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489878#comment-17489878
 ] 

Apache Spark commented on SPARK-38146:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/35470

> UDAF fails to aggregate TIMESTAMP_NTZ column
> 
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38146) UDAF fails to aggregate TIMESTAMP_NTZ column

2022-02-09 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-38146:
--
Summary: UDAF fails to aggregate TIMESTAMP_NTZ column  (was: UDAF fails 
with unsafe row buffer containing a TIMESTAMP_NTZ column)

> UDAF fails to aggregate TIMESTAMP_NTZ column
> 
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-09 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489857#comment-17489857
 ] 

Jungtaek Lim edited comment on SPARK-37814 at 2/9/22, 11:03 PM:


If we are open to release for maintenance for versions under 3.3.0, it may not 
be a crazy idea we try to look at and adopt reload4j for these versions, 
assuming they only change on security stuffs.


was (Author: kabhwan):
If we are open to release for maintenance for versions under 3.3.0, it may not 
be a crazy idea we try to look at and adopt reload4j for these versions.

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-09 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489857#comment-17489857
 ] 

Jungtaek Lim commented on SPARK-37814:
--

If we are open to release for maintenance for versions under 3.3.0, it may not 
be a crazy idea we try to look at and adopt reload4j for these versions.

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38146) UDAF fails with unsafe row buffer containing a TIMESTAMP_NTZ column

2022-02-09 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-38146:
--
Summary: UDAF fails with unsafe row buffer containing a TIMESTAMP_NTZ 
column  (was: UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column)

> UDAF fails with unsafe row buffer containing a TIMESTAMP_NTZ column
> ---
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489821#comment-17489821
 ] 

Dongjoon Hyun commented on SPARK-37814:
---

Thank you for sharing, [~ste...@apache.org]. And, thank you for confirming, 
[~sldr].

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Handle `Pacific/Kanton` in DateTimeUtilsSuite

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Description: 
This issue aims to fix the flaky UT failures due to 
https://bugs.openjdk.java.net/browse/JDK-8274407 (Update Timezone Data to 
2021c) and its backport commits that renamed 'Pacific/Enderbury' to 
'Pacific/Kanton' in the latest Java 17.0.2, 11.0.14, and 8u311.

Rename Pacific/Enderbury to Pacific/Kanton.

**MASTER**
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}

**BRANCH-3.2**
- https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:771)
{code}

  was:
**MASTER**
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}

**BRANCH-3.2**
- https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:771)
{code}


> Handle `Pacific/Kanton` in DateTimeUtilsSuite
> -
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> This issue aims to fix the flaky UT failures due to 
> https://bugs.openjdk.java.net/browse/JDK-8274407 (Update Timezone Data to 
> 2021c) and its backport commits that renamed 'Pacific/Enderbury' to 
> 'Pacific/Kanton' in the latest Java 17.0.2, 11.0.14, and 8u311.
> Rename Pacific/Enderbury to Pacific/Kanton.
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Handle `Pacific/Kanton` in DateTimeUtilsSuite

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Summary: Handle `Pacific/Kanton` in DateTimeUtilsSuite  (was: Flaky Test: 
DateTimeUtilsSuite.`daysToMicros and microsToDays`)

> Handle `Pacific/Kanton` in DateTimeUtilsSuite
> -
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38151.
---
Fix Version/s: 3.3.0
   3.2.2
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 35468
[https://github.com/apache/spark/pull/35468]

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.3
>
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38151:
-

Assignee: Dongjoon Hyun

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489793#comment-17489793
 ] 

Sujit Biswas commented on SPARK-38061:
--

[~abhinavofficial] fyi,

I was able to build the htrace-core* project locally with the right jackson-* 
dependency, replace the exiting htrace-core4-4.1.0-incubating jar before 
building the docker 

 

this will fix most of the critical CVEs

 

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38155:


Assignee: (was: Apache Spark)

> Disallow distinct aggregate in lateral subqueries with unsupported correlated 
> predicates
> 
>
> Key: SPARK-38155
> URL: https://issues.apache.org/jira/browse/SPARK-38155
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Block lateral subqueries in CheckAnalysis that contain DISTINCT aggregate and 
> correlated non-equality predicates. This can lead to incorrect results as 
> DISTINCT will be rewritten as Aggregate during the optimization phase.
> For example
> {code:java}
> CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
> CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
> SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
> {code}
> The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 
> 2), (0, 1, 2)].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489747#comment-17489747
 ] 

Apache Spark commented on SPARK-38155:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/35469

> Disallow distinct aggregate in lateral subqueries with unsupported correlated 
> predicates
> 
>
> Key: SPARK-38155
> URL: https://issues.apache.org/jira/browse/SPARK-38155
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Block lateral subqueries in CheckAnalysis that contain DISTINCT aggregate and 
> correlated non-equality predicates. This can lead to incorrect results as 
> DISTINCT will be rewritten as Aggregate during the optimization phase.
> For example
> {code:java}
> CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
> CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
> SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
> {code}
> The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 
> 2), (0, 1, 2)].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38155:


Assignee: Apache Spark

> Disallow distinct aggregate in lateral subqueries with unsupported correlated 
> predicates
> 
>
> Key: SPARK-38155
> URL: https://issues.apache.org/jira/browse/SPARK-38155
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Block lateral subqueries in CheckAnalysis that contain DISTINCT aggregate and 
> correlated non-equality predicates. This can lead to incorrect results as 
> DISTINCT will be rewritten as Aggregate during the optimization phase.
> For example
> {code:java}
> CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
> CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
> SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
> {code}
> The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 
> 2), (0, 1, 2)].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-38155:
-
Description: 
Block lateral subqueries in CheckAnalysis that contain DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example
{code:java}
CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
{code}
The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 2), 
(0, 1, 2)].

  was:
Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example
{code:java}
CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
{code}
The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 2), 
(0, 1, 2)].


> Disallow distinct aggregate in lateral subqueries with unsupported correlated 
> predicates
> 
>
> Key: SPARK-38155
> URL: https://issues.apache.org/jira/browse/SPARK-38155
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Block lateral subqueries in CheckAnalysis that contain DISTINCT aggregate and 
> correlated non-equality predicates. This can lead to incorrect results as 
> DISTINCT will be rewritten as Aggregate during the optimization phase.
> For example
> {code:java}
> CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
> CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
> SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
> {code}
> The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 
> 2), (0, 1, 2)].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37814) Migrating from log4j 1 to log4j 2

2022-02-09 Thread Stephen L. De Rudder (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489742#comment-17489742
 ] 

Stephen L. De Rudder commented on SPARK-37814:
--

Thank you for the responses, [~dongjoon] and [~ste...@apache.org]. I have just 
tested replacing log4j-1.2.17.jar with reload4j-1.2.19.jar in the jars 
directory for Spark and it all just worked.

This is an excellent workaround for anyone that wants the log4j 1.2.17 CVEs 
fixed now, instead of waiting for Spark 3.3.0.

> Migrating from log4j 1 to log4j 2
> -
>
> Key: SPARK-37814
> URL: https://issues.apache.org/jira/browse/SPARK-37814
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This is umbrella ticket for all tasks related to migrating to log4j2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Affects Version/s: 3.1.3

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489732#comment-17489732
 ] 

Apache Spark commented on SPARK-38151:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35468

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38151:


Assignee: Apache Spark

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489731#comment-17489731
 ] 

Apache Spark commented on SPARK-38151:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35468

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38151:


Assignee: (was: Apache Spark)

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Description: 
**MASTER**
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}

**BRANCH-3.2**
- https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:771)
{code}

  was:
**MASTER**
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}

**BRANCH-3.2**
- https://github.com/apache/spark/runs/5122380604?check_suite_focus=true


> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (643 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:771)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Affects Version/s: 3.2.2

> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Description: 
**MASTER**
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}

**BRANCH-3.2**
- https://github.com/apache/spark/runs/5122380604?check_suite_focus=true

  was:
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}


> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> **MASTER**
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}
> **BRANCH-3.2**
> - https://github.com/apache/spark/runs/5122380604?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38169) Use OffHeap memory if configured in vectorized DeltaByteArray reader

2022-02-09 Thread Parth Chandra (Jira)
Parth Chandra created SPARK-38169:
-

 Summary: Use OffHeap memory if configured in vectorized 
DeltaByteArray reader
 Key: SPARK-38169
 URL: https://issues.apache.org/jira/browse/SPARK-38169
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Parth Chandra


The VectorizedDeltaByteArray reader allocates some vectors for internal use 
that are allocated off of heap memory. Depending on the configuration, the 
vectors should be either on-heap or off-heap. 
To support off-heap vectors, the vectorized readers must be made closable and 
memory allocated for the vectors must be freed on close.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-38155:
-
Description: 
Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example
{code:java}
CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
{code}
The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 2), 
(0, 1, 2)].

  was:
Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example

 
{code:java}
CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
{code}
 

The correct results should be (0, 1, 2) but currently, it gives  [(0, 1, 2), 
(0, 1, 2)].


> Disallow distinct aggregate in lateral subqueries with unsupported correlated 
> predicates
> 
>
> Key: SPARK-38155
> URL: https://issues.apache.org/jira/browse/SPARK-38155
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate 
> and correlated non-equality predicates. This can lead to incorrect results as 
> DISTINCT will be rewritten as Aggregate during the optimization phase.
> For example
> {code:java}
> CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
> CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
> SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
> {code}
> The correct results should be (0, 1, 2) but currently, it gives  be[(0, 1, 
> 2), (0, 1, 2)].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489711#comment-17489711
 ] 

Sujit Biswas edited comment on SPARK-38061 at 2/9/22, 5:56 PM:
---

[~abhinavofficial] there are multiple vulnerabilities, as you can see the 
attachments, htrace-core4-4.1.0-incubating is the jar which is causing the most

 

[~hyukjin.kwon] as the jar is shaded , what is the Jackson-databind classes, 
htrace class/code is using?


was (Author: JIRAUSER284395):
[~abhinavofficial] there are multiple vulnerabilities, as you can see the 
attachments, htrace-core4-4.1.0-incubating is the jar which is causing the most

 

[~hyukjin.kwon] as the jar is shaded , what is the Jackson-databind classes, 
htrace class is using?

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489711#comment-17489711
 ] 

Sujit Biswas commented on SPARK-38061:
--

[~abhinavofficial] there are multiple vulnerabilities, as you can see the 
attachments, htrace-core4-4.1.0-incubating is the jar which is causing the most

 

[~hyukjin.kwon] as the jar is shaded , what is the Jackson-databind classes, 
htrace class is using?

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38056:
--
Fix Version/s: 3.1.3
   3.2.2

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38056:
--
Target Version/s:   (was: 3.1.2, 3.2.0)

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value

2022-02-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38120:
--
Fix Version/s: 3.1.3

> HiveExternalCatalog.listPartitions is failing when partition column name is 
> upper case and dot in partition value
> -
>
> Key: SPARK-38120
> URL: https://issues.apache.org/jira/browse/SPARK-38120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Minor
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> HiveExternalCatalog.listPartitions method call is failing when a partition 
> column name is upper case and partition value contains dot. It's related to 
> this change 
> [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23]
> The test casein that PR does not produce the issue as partition column name 
> is lower case.
>  
> Below how to reproduce the issue:
> scala> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.catalyst.TableIdentifier
> scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
> (partCol1 STRING, partCol2 STRING)")
> scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 
> 'i.j') VALUES (100, 'John')")                               
> scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
> Some(Map("partCol2" -> "i.j"))).foreach(println)
> java.util.NoSuchElementException: key not found: partcol2
>   at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
>   at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
>   ... 47 elided



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-38166:
--
Description: 
We're seeing duplicates after running the following
{code:java}
def compute_shipments(shipments):
shipments = shipments.dropDuplicates(["ship_trck_num"])
shipments = shipments.repartition(4)
return shipments
{code}
and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs - maybe you have 
ideas.

  was:
We're seeing duplicates after running the following 

{code}
def compute_shipments(shipments):
shipments = shipments.dropDuplicates(["ship_trck_num"])
shipments = shipments.repartition(4)
return shipments
{code}

and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs and the 
notionalized input data - maybe you have ideas.


> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following
> {code:java}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs - maybe you have 
> ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489670#comment-17489670
 ] 

Apache Spark commented on SPARK-38163:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35467

> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.
> The example below demonstrates the issue:
> {code:scala}
> scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
> org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
> null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489668#comment-17489668
 ] 

Apache Spark commented on SPARK-38163:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35467

> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.
> The example below demonstrates the issue:
> {code:scala}
> scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
> org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
> null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38163:


Assignee: Apache Spark  (was: Max Gekk)

> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.
> The example below demonstrates the issue:
> {code:scala}
> scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
> org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
> null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38163:


Assignee: Max Gekk  (was: Apache Spark)

> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.
> The example below demonstrates the issue:
> {code:scala}
> scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
> org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
> null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38163:
-
Description: 
When the cause exception is `AnalysisException` at
https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
 Spark loses info about the error class. Need to preserve the info.

The example below demonstrates the issue:

{code:scala}
scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
null
{code}


  was:
When the cause exception is `AnalysisException` at
https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
 Spark loses info about the error class. Need to preserve the info.


> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.
> The example below demonstrates the issue:
> {code:scala}
> scala> try { sql("select format_string('%0$s', 'Hello')") } catch { case e: 
> org.apache.spark.sql.AnalysisException => println(e.getErrorClass) }
> null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38168:


Assignee: Apache Spark

> LikeSimplification handles escape character
> ---
>
> Key: SPARK-38168
> URL: https://issues.apache.org/jira/browse/SPARK-38168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dooyoung Hwang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, LikeSimplification rule of catalyst is skipped if the pattern 
> contains escape character.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
> +- Relation[c_1#0,c_2#1,c_3#2] ...
> {noformat}
> The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
> StringType.
> LikeSimplification rule can consider a special character(wildcard(%, _) or 
> escape character) as a plain character if the character follows an escape 
> character.
> By doing that, LikeSimplification rule can optimize the filter like below.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
> +- Relation[c_1#0,c_2#1,c_3#2] 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489611#comment-17489611
 ] 

Apache Spark commented on SPARK-38168:
--

User 'Dooyoung-Hwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35465

> LikeSimplification handles escape character
> ---
>
> Key: SPARK-38168
> URL: https://issues.apache.org/jira/browse/SPARK-38168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dooyoung Hwang
>Priority: Major
>
> Currently, LikeSimplification rule of catalyst is skipped if the pattern 
> contains escape character.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
> +- Relation[c_1#0,c_2#1,c_3#2] ...
> {noformat}
> The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
> StringType.
> LikeSimplification rule can consider a special character(wildcard(%, _) or 
> escape character) as a plain character if the character follows an escape 
> character.
> By doing that, LikeSimplification rule can optimize the filter like below.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
> +- Relation[c_1#0,c_2#1,c_3#2] 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489612#comment-17489612
 ] 

Apache Spark commented on SPARK-38168:
--

User 'Dooyoung-Hwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35465

> LikeSimplification handles escape character
> ---
>
> Key: SPARK-38168
> URL: https://issues.apache.org/jira/browse/SPARK-38168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dooyoung Hwang
>Priority: Major
>
> Currently, LikeSimplification rule of catalyst is skipped if the pattern 
> contains escape character.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
> +- Relation[c_1#0,c_2#1,c_3#2] ...
> {noformat}
> The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
> StringType.
> LikeSimplification rule can consider a special character(wildcard(%, _) or 
> escape character) as a plain character if the character follows an escape 
> character.
> By doing that, LikeSimplification rule can optimize the filter like below.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
> +- Relation[c_1#0,c_2#1,c_3#2] 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38168:


Assignee: (was: Apache Spark)

> LikeSimplification handles escape character
> ---
>
> Key: SPARK-38168
> URL: https://issues.apache.org/jira/browse/SPARK-38168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dooyoung Hwang
>Priority: Major
>
> Currently, LikeSimplification rule of catalyst is skipped if the pattern 
> contains escape character.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
> +- Relation[c_1#0,c_2#1,c_3#2] ...
> {noformat}
> The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
> StringType.
> LikeSimplification rule can consider a special character(wildcard(%, _) or 
> escape character) as a plain character if the character follows an escape 
> character.
> By doing that, LikeSimplification rule can optimize the filter like below.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
> +- Relation[c_1#0,c_2#1,c_3#2] 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38061.
--
Resolution: Incomplete

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489608#comment-17489608
 ] 

Hyukjin Kwon commented on SPARK-38061:
--

Again, this doesn't directly affect Spark since Spark doesn't use shaded 
Jackson from htrace. If this JIRA focuses on htrace alone, please change the 
JIRA to be dedicated to htrace instead of bundling other dependcy issues. I 
explained all the reasons and what to do. Please don't reopen without taking 
these actions.

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489591#comment-17489591
 ] 

Apache Spark commented on SPARK-38056:
--

User 'kuwii' has created a pull request for this issue:
https://github.com/apache/spark/pull/35464

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489588#comment-17489588
 ] 

Apache Spark commented on SPARK-38056:
--

User 'kuwii' has created a pull request for this issue:
https://github.com/apache/spark/pull/35463

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489589#comment-17489589
 ] 

Apache Spark commented on SPARK-38056:
--

User 'kuwii' has created a pull request for this issue:
https://github.com/apache/spark/pull/35463

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Dooyoung Hwang (Jira)
Dooyoung Hwang created SPARK-38168:
--

 Summary: LikeSimplification handles escape character
 Key: SPARK-38168
 URL: https://issues.apache.org/jira/browse/SPARK-38168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: Dooyoung Hwang


Currently, LikeSimplification rule of catalyst is skipped if the pattern 
contains escape character.

{noformat}
SELECT * FROM tbl LIKE '%100\%'

...
== Optimized Logical Plan ==
Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
+- Relation[c_1#0,c_2#1,c_3#2] ...
{noformat}
The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
StringType.

LikeSimplification rule can consider a special character(wildcard(%, _) or 
escape character) as a plain character if the character follows an escape 
character.
By doing that, LikeSimplification rule can optimize the filter like below.

{noformat}
SELECT * FROM tbl LIKE '%100\%'

...
== Optimized Logical Plan ==
Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
+- Relation[c_1#0,c_2#1,c_3#2] 
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37969) Hive Serde insert should check schema before execution

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37969:
---

Assignee: angerszhu

> Hive Serde insert should check schema before execution
> --
>
> Key: SPARK-37969
> URL: https://issues.apache.org/jira/browse/SPARK-37969
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
> java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
> 'struct' but '(' is found.
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
> [info]at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> [info]at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info]
> [info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
> expected ';' but got 'IF' at line 2:   optional int32 (IF
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:126)
> [info]   at 
> 

[jira] [Resolved] (SPARK-37969) Hive Serde insert should check schema before execution

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37969.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35258
[https://github.com/apache/spark/pull/35258]

> Hive Serde insert should check schema before execution
> --
>
> Key: SPARK-37969
> URL: https://issues.apache.org/jira/browse/SPARK-37969
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
> java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
> 'struct' but '(' is found.
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
> [info]at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> [info]at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info]
> [info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
> expected ';' but got 'IF' at line 2:   optional int32 (IF
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
> [info]   at 
> 

[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489579#comment-17489579
 ] 

Willi Raschkowski commented on SPARK-38166:
---

Linking SPARK-23207 (which is closed but looks very related) and SPARK-25342 
(which is open but I understand would only explain this if we were operating on 
RDDs).

> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37585:
---

Assignee: Sandeep Katta

> DSV2 InputMetrics are not getting update in corner case
> ---
>
> Key: SPARK-37585
> URL: https://issues.apache.org/jira/browse/SPARK-37585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
>
> In some corner cases, DSV2 is not updating the input metrics.
>  
> This is very special case where the number of records read are less than 1000 
> and *hasNext* is not called for last element(cz input.hasNext returns false 
> so MetricsIterator.hasNext is not called)
>  
> hasNext implementation of MetricsIterator
>  
> {code:java}
> override def hasNext: Boolean = {
>   if (iter.hasNext) {
> true
>   } else {
> metricsHandler.updateMetrics(0, force = true)
> false
>   } {code}
>  
> You reproduce this issue easily in spark-shell by running below code
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler.{SparkListener, 
> SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
> val dir = "Users/tmp1"
> spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
> val df = spark.read.format("parquet").load(dir)
> val bytesReads = new mutable.ArrayBuffer[Long]()
> val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
> SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
>     recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
>   }
> }
> spark.sparkContext.addSparkListener(bytesReadListener)
> try {
> df.limit(10).collect()
> assert(recordsRead.sum > 0)
> assert(bytesReads.sum > 0)
> } finally {
> spark.sparkContext.removeSparkListener(bytesReadListener)
> } {code}
> This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
> updateMetrics API is not called
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37585.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35432
[https://github.com/apache/spark/pull/35432]

> DSV2 InputMetrics are not getting update in corner case
> ---
>
> Key: SPARK-37585
> URL: https://issues.apache.org/jira/browse/SPARK-37585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.3.0
>
>
> In some corner cases, DSV2 is not updating the input metrics.
>  
> This is very special case where the number of records read are less than 1000 
> and *hasNext* is not called for last element(cz input.hasNext returns false 
> so MetricsIterator.hasNext is not called)
>  
> hasNext implementation of MetricsIterator
>  
> {code:java}
> override def hasNext: Boolean = {
>   if (iter.hasNext) {
> true
>   } else {
> metricsHandler.updateMetrics(0, force = true)
> false
>   } {code}
>  
> You reproduce this issue easily in spark-shell by running below code
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler.{SparkListener, 
> SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
> val dir = "Users/tmp1"
> spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
> val df = spark.read.format("parquet").load(dir)
> val bytesReads = new mutable.ArrayBuffer[Long]()
> val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
> SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
>     recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
>   }
> }
> spark.sparkContext.addSparkListener(bytesReadListener)
> try {
> df.limit(10).collect()
> assert(recordsRead.sum > 0)
> assert(bytesReads.sum > 0)
> } finally {
> spark.sparkContext.removeSparkListener(bytesReadListener)
> } {code}
> This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
> updateMetrics API is not called
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37652) Support optimize skewed join through union

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37652.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34908
[https://github.com/apache/spark/pull/34908]

> Support optimize skewed join through union
> --
>
> Key: SPARK-37652
> URL: https://issues.apache.org/jira/browse/SPARK-37652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
> Fix For: 3.3.0
>
>
> `OptimizeSkewedJoin` rule will take effect only when the plan has two 
> ShuffleQueryStageExec。
> With `Union`, it might break the assumption. For example, the following plans
> *scenes 1*
> {noformat}
> Union
> SMJ
> ShuffleQueryStage
> ShuffleQueryStage
> SMJ
> ShuffleQueryStage
> ShuffleQueryStage
> {noformat}
> *scenes 2*
> {noformat}
> Union
> SMJ
> ShuffleQueryStage
> ShuffleQueryStage
> HashAggregate
> {noformat}
> when one or more of the SMJ data in the above plan is skewed, it cannot be 
> processed at present.
> It's better to support partial optimize with Union.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38167) CSV parsing error when using escape='"'

2022-02-09 Thread Marnix van den Broek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marnix van den Broek updated SPARK-38167:
-
Description: 
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:

 
{code:java}
col1,col2
"",",a"
{code}
 

using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:

 
{code:java}
spark.read.csv(path, escape='"', header=True).show()
 
+++
|col1|col2|
+++
|null|  ,a|
+++   {code}
 
 Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:

 
{code:java}
spark.read.csv(path, escape='"', header=True).select('col2').show()
 
++
|col2|
++
|  a"|
++{code}
 
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 

  was:
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}col1,col2

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()
 
|*col1*|*col2*|
|null|,a|
{quote}
 Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()
 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 


> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', 

[jira] [Updated] (SPARK-38167) CSV parsing error when using escape='"'

2022-02-09 Thread Marnix van den Broek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marnix van den Broek updated SPARK-38167:
-
Description: 
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}col1,col2

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()
 
|*col1*|*col2*|
|null|,a|
{quote}
 Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()
 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 

  was:
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}{{col1,col2}}

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()

 
|*col1*|*col2*|
|null|,a|

 
{quote}
Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()

 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 


> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
> {quote}col1,col2
> {{"",",a"}}
> {quote}
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
> {quote}spark.read.csv(path, escape='"', header=True).show()
>  
> |*col1*|*col2*|
> |null|,a|
> {quote}
>  Spark does yield this result, so far so 

[jira] [Created] (SPARK-38167) CSV parsing error when using escape='"'

2022-02-09 Thread Marnix van den Broek (Jira)
Marnix van den Broek created SPARK-38167:


 Summary: CSV parsing error when using escape='"' 
 Key: SPARK-38167
 URL: https://issues.apache.org/jira/browse/SPARK-38167
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 3.2.1
 Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
cluster.
Reporter: Marnix van den Broek


hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}{{col1,col2}}

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()

 
|*col1*|*col2*|
|null|,a|

 
{quote}
Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()

 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34511) Current Security vulnerabilities in spark libraries

2022-02-09 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489544#comment-17489544
 ] 

Abhinav Kumar commented on SPARK-34511:
---

Updating to Spark 3.2.1 does solve most of the issues. Critical ones left in 
3.2.1 are log2j 1.2.17 and htrace-core4-4.1.0-incubating SPARK-38061

We still have medium vulnerability

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>   
> [Update - still present]com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> [Update - still present]com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> [Update - still present]Log4j : log4j : 1.2.17
>  SocketServer class that is vulnerable to deserialization of untrusted data: 
> * https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
>           Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> [Fixed]-apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685-
>  * 
> -[https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]-
>  * [-https://bugzilla.redhat.com/show_bug.cgi?id=1019176-]
>  
> [Update - still present]com.fasterxml.jackson.core : jackson-databind : 
> 2.10.0 * [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> [Update - still present ]commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> [Update - still present ]commons-io : commons-io : 2.5 * 
> [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> [Upgraded to 4.1.51.Final still with vulnerabilities, see new below]-io.netty 
> : netty-all : 4.1.47.Final * [https://github.com/netty/netty/issues/10351]-
>  * [-https://github.com/netty/netty/pull/10560-]
>  
> [Update - still present]org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> [Update - changed to
> org.apache.hadoop : hadoop-hdfs-client : 3.2.0 see new below
> ]-org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]-
>  * 
> -[https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]-
>  * -[https://hadoop.apache.org/cve_list.html]-
>  * -[https://www.openwall.com/lists/oss-security/2019/01/24/3]-
>   --  
>  -org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]-
>  * 
> -[https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]-
>  
> [Update - still present]org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> [Update - still present]org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> [Update -still present]org.eclipse.jetty : 

[jira] [Commented] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489537#comment-17489537
 ] 

Abhinav Kumar commented on SPARK-38061:
---

[~hyukjin.kwon] [~sujitbiswas] Are we agreeing to track the vulnerability fix 
for htrace-core4-4.1.0-incubating (building it with jackson 2.12.3 or later). 
BTW.. even 2.12.3 is showing up with medium criticality vulnerability - but 
that is a battle for another day.

Also, [~hyukjin.kwon] I was hoping to see if we can release another version of 
Spark, say 3.2.3 with vulnerability fixes. The issue is that we are using Spark 
in our company and management is getting concerned due to these vulnerability. 
What do you think?

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36145) Remove Python 3.6 support in codebase and CI

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36145:
-
Labels: release-notes  (was: )

> Remove Python 3.6 support in codebase and CI
> 
>
> Key: SPARK-36145
> URL: https://issues.apache.org/jira/browse/SPARK-36145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Critical
>  Labels: release-notes
>
> Python 3.6 is deprecated via SPARK-35938 at Apache Spark 3.2. We should 
> remove it in Spark 3.3.
> This JIRA also target to all the changes in CI and development not only 
> user-facing changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >