[jira] [Updated] (SPARK-38584) Unify the data validation

2022-03-17 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-38584:
-
Description: 
1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly and return model containing invalid 
coefficients, like LinearSVC
 * the training may fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

We should make ml algorithms fail fast, if the input dataset is invalid.

 

2, there exists some methods to validate input labels and weights in different 
files:
 * {{org.apache.spark.ml.functions}}
 * org.apache.spark.ml.util.DatasetUtils
 * org.apache.spark.ml.util.MetadataUtils,
 * org.apache.spark.ml.Predictor
 * etc.

 

I think it is time to unify realtive methods to one source file.

 

  was:
1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly and return model containing invalid 
coefficients, like LinearSVC
 * the training may fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 
different files:

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils,

org.apache.spark.ml.Predictor, etc.

 

I think it is time to unify realtive methods to one source file.

 


> Unify the data validation
> -
>
> Key: SPARK-38584
> URL: https://issues.apache.org/jira/browse/SPARK-38584
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> 1, input vector validation is missing in most algorithms, when the input 
> dataset contains some invalid values (NaN/Infinity), then:
>  * the training may run successfuly and return model containing invalid 
> coefficients, like LinearSVC
>  * the training may fail with irrelevant message, like KMeans
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
> val km = new KMeans().setK(2)
> scala> km.fit(df)
> 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 
> 113)
> java.lang.IllegalArgumentException: requirement failed: Both norms should be 
> greater or equal to 0.0, found norm1=NaN, norm2=Infinity
>     at scala.Predef$.require(Predef.scala:281)
>     at 
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
> {code}
>  
> 

[jira] [Updated] (SPARK-38521) Throw Exception if overwriting hive partition table with dynamic and staticPartitionOverwriteMode

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38521:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Throw Exception if overwriting hive partition table with dynamic and 
> staticPartitionOverwriteMode
> -
>
> Key: SPARK-38521
> URL: https://issues.apache.org/jira/browse/SPARK-38521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> The `spark.sql.sources.partitionOverwriteMode` allows us to overwrite the 
> existing data of the table through staticmode, but for hive table, it is 
> disastrous. It may deleting all data in hive partitioned table while writing 
> with dynamic overwrite and `partitionOverwriteMode=STATIC`.
> Here we add a check for this and throw Exception if this happends.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38521) Throw Exception if overwriting hive partition table with dynamic and staticPartitionOverwriteMode

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38521:
---
Fix Version/s: 3.4.0
   (was: 3.3.0)

> Throw Exception if overwriting hive partition table with dynamic and 
> staticPartitionOverwriteMode
> -
>
> Key: SPARK-38521
> URL: https://issues.apache.org/jira/browse/SPARK-38521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> The `spark.sql.sources.partitionOverwriteMode` allows us to overwrite the 
> existing data of the table through staticmode, but for hive table, it is 
> disastrous. It may deleting all data in hive partitioned table while writing 
> with dynamic overwrite and `partitionOverwriteMode=STATIC`.
> Here we add a check for this and throw Exception if this happends.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38427) DataFilter pushed down with PartitionFilter for Orc

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38427:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> DataFilter pushed down with PartitionFilter for Orc
> ---
>
> Key: SPARK-38427
> URL: https://issues.apache.org/jira/browse/SPARK-38427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, for orc data source, the Filter is divided into DataFilter and 
> PartitionFilter when it is pushed down, but when the Filter removes the 
> PartitionFilter, it means that all Partitions will scan all DataFilter 
> conditions, which may cause full data scan.
> Based on SPARK-38041, we can pushdown dataFilter with partitionFilter to ORC, 
> and remove partitionFilter at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38440) DataFilter pushed down with PartitionFilter fro Parquet V1 Datasource

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38440:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> DataFilter pushed down with PartitionFilter fro Parquet V1 Datasource
> -
>
> Key: SPARK-38440
> URL: https://issues.apache.org/jira/browse/SPARK-38440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> Based on SPARK-38041, we can pushdown dataFilter with partitionFilter to 
> Parquet V1 datasource, and remove partitionFilter at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38433) Add Shell Code Style Check Action

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38433:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Add Shell Code Style Check Action
> -
>
> Key: SPARK-38433
> URL: https://issues.apache.org/jira/browse/SPARK-38433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> There is no shell code check in the current spark github actions. Some shell 
> codes are written incorrectly and run abnormally in special cases. Besides, 
> they cannot also pass the check of the shellcheck plugin, especially in IDEA 
> or shellcheck actions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37933) Limit push down for parquet datasource v2

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-37933:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Limit push down for parquet datasource v2
> -
>
> Key: SPARK-37933
> URL: https://issues.apache.org/jira/browse/SPARK-37933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Based on SPARK-37020, we can support limit push down to parquet datasource v2 
> reader. It can stop scanning parquet early, and reduce network and disk IO.
> Current limit parse status for parquet
> {code:java}
> == Parsed Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Analyzed Logical Plan ==
> a: int, b: int
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Optimized Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Physical Plan ==
> CollectLimit 10
> +- *(1) ColumnarToRow
>    +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], 
> PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: 
> struct, PushedFilters: [], PushedAggregation: [], PushedGroupBy: 
> [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38041) DataFilter pushed down with PartitionFilter

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-38041:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> DataFilter pushed down with PartitionFilter
> ---
>
> Key: SPARK-38041
> URL: https://issues.apache.org/jira/browse/SPARK-38041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
> [], ReadSchema: struct, PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
> PushedGroupBy: [] RuntimeFilters: []
> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le...,
>  PushedGroupBy: [], ReadSchema: struct, PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))],
>  PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37919) SQL UI shows accurate Metrics with stage retries

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-37919:
---
Affects Version/s: 3.4.0
   (was: 3.3.0)

> SQL UI shows accurate Metrics with stage retries
> 
>
> Key: SPARK-37919
> URL: https://issues.apache.org/jira/browse/SPARK-37919
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, the SQL UI page can show us the metrics of sql execution. However 
> the metrics are not accurate with task or stage retries. This PR is used to 
> provide accurate Metrics in the UI page, mainly based on SPARK-37831 .



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37831) Add task partition id in metrics

2022-03-17 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-37831:
---
Affects Version/s: (was: 3.2.1)

> Add task partition id in metrics
> 
>
> Key: SPARK-37831
> URL: https://issues.apache.org/jira/browse/SPARK-37831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> There is no partition id in current metrics, it makes difficult to trace 
> stage metrics, such as stage shuffle read, especially when there are stage 
> retries. It is also impossible to check task metrics between different 
> applications.
> {code:java}
> class TaskData private[spark](
> val taskId: Long,
> val index: Int,
> val attempt: Int,
> val launchTime: Date,
> val resultFetchStart: Option[Date],
> @JsonDeserialize(contentAs = classOf[JLong])
> val duration: Option[Long],
> val executorId: String,
> val host: String,
> val status: String,
> val taskLocality: String,
> val speculative: Boolean,
> val accumulatorUpdates: Seq[AccumulableInfo],
> val errorMessage: Option[String] = None,
> val taskMetrics: Option[TaskMetrics] = None,
> val executorLogs: Map[String, String],
> val schedulerDelay: Long,
> val gettingResultTime: Long) {code}
> Adding partitionId in Task Data can not only make us easy to trace task 
> metrics, also can make it possible to collect metrics for actual stage 
> outputs, especially when stage retries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38563.
--
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35907
[https://github.com/apache/spark/pull/35907]

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508568#comment-17508568
 ] 

Apache Spark commented on SPARK-38593:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35909

> Incorporate numRowsDroppedByWatermark metric from 
> SessionWindowStateStoreRestoreExec into StateOperatorProgress
> ---
>
> Key: SPARK-38593
> URL: https://issues.apache.org/jira/browse/SPARK-38593
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Although we added `numRowsDroppedByWatermark` to 
> `SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
> will be collected metrics for `StateOperatorProgress`. So if we need 
> `numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
> used in streaming listener, we need to incorporate 
> `SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508542#comment-17508542
 ] 

Apache Spark commented on SPARK-38204:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35908

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness, releasenotes
> Fix For: 3.3.0
>
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns 

[jira] [Commented] (SPARK-38204) All state operators are at a risk of inconsistency between state partitioning and operator partitioning

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508541#comment-17508541
 ] 

Apache Spark commented on SPARK-38204:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35908

> All state operators are at a risk of inconsistency between state partitioning 
> and operator partitioning
> ---
>
> Key: SPARK-38204
> URL: https://issues.apache.org/jira/browse/SPARK-38204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness, releasenotes
> Fix For: 3.3.0
>
>
> Except stream-stream join, all stateful operators use ClusteredDistribution 
> as a requirement of child distribution.
> ClusteredDistribution is very relaxed one - any output partitioning can 
> satisfy the distribution if the partitioning can ensure all tuples having 
> same grouping keys are placed in same partition.
> To illustrate an example, support we do streaming aggregation like below code:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> In the code, streaming aggregation operator will be involved in physical 
> plan, which would have ClusteredDistribution("group1", "group2", "window").
> The problem is, various output partitionings can satisfy this distribution:
>  * RangePartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination), with any sort order (asc/desc)
>  * HashPartitioning
>  ** This accepts exact and subset of the grouping key, with any order of keys 
> (combination)
>  * (upcoming Spark 3.3.0+) DataSourcePartitioning
>  ** output partitioning provided by data source will be able to satisfy 
> ClusteredDistribution, which will make things worse (assuming data source can 
> provide different output partitioning relatively easier)
> e.g. even we only consider HashPartitioning, HashPartitioning("group1"), 
> HashPartitioning("group2"), HashPartitioning("group1", "group2"), 
> HashPartitioning("group2", "group1"), HashPartitioning("group1", "group2", 
> "window"), etc.
> The requirement of state partitioning is much more strict, since we should 
> not change the partitioning once it is partitioned and built. *It should 
> ensure that all tuples having same grouping keys are placed in same partition 
> (same partition ID) across query lifetime.*
> *The impedance of distribution requirement between ClusteredDistribution and 
> state partitioning leads correctness issue silently.*
> For example, let's assume we have a streaming query like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group2")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group2") satisfies ClusteredDistribution("group1", "group2", 
> "window"), so Spark won't introduce additional shuffle there, and state 
> partitioning would be HashPartitioning("group2").
> we run this query for a while, and stop the query, and change the manual 
> partitioning like below:
> {code:java}
> df
>   .withWatermark("timestamp", "30 minutes")
>   .repartition("group1")
>   .groupBy("group1", "group2", window("timestamp", "10 minutes"))
>   .agg(count("*")) {code}
> repartition("group1") also satisfies ClusteredDistribution("group1", 
> "group2", "window"), so Spark won't introduce additional shuffle there. That 
> said, child output partitioning of streaming aggregation operator would be 
> HashPartitioning("group1"), whereas state partitioning is 
> HashPartitioning("group2").
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query]
> In SS guide doc we enumerate the unsupported modifications of the query 
> during the lifetime of streaming query, but there is no notion of this.
> Making this worse, Spark doesn't store any information on state partitioning 
> (that said, there is no way to validate), so *Spark simply allows this change 
> and brings up correctness issue while the streaming query runs like no 
> problem at all.* The only way to indicate the correctness is from the result 
> of the query.
> We have no idea whether end users already suffer from this in their queries 
> or not. *The only way to look into is to list up all state rows and apply 
> hash function with expected grouping keys, and confirm all rows provide the 
> exact partition ID where they are in.* If it turns 

[jira] [Commented] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508536#comment-17508536
 ] 

Hyukjin Kwon commented on SPARK-38592:
--

[~JuzDDM] mind showing the fully self-contained reproducer?

> Column name contains back tick `
> 
>
> Key: SPARK-38592
> URL: https://issues.apache.org/jira/browse/SPARK-38592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Dennis Du
>Priority: Major
>
> Try to modify the data frame to ensure column names have no special 
> characters.
> {code:java}
> df.columns.map { columnName =>
>df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
> }
> {code}
> *surroundingBackTickedName()* will enclose column name with backticks.
> However, col() kept having issue with column name contains back tick because 
> {*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
> wondering if there is a workaround
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
> COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
> COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
> COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38592:
-
Labels:   (was: bulk-closed)

> Column name contains back tick `
> 
>
> Key: SPARK-38592
> URL: https://issues.apache.org/jira/browse/SPARK-38592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Dennis Du
>Priority: Major
>
> Try to modify the data frame to ensure column names have no special 
> characters.
> {code:java}
> df.columns.map { columnName =>
>df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
> }
> {code}
> *surroundingBackTickedName()* will enclose column name with backticks.
> However, col() kept having issue with column name contains back tick because 
> {*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
> wondering if there is a workaround
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
> COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
> COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
> COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38563:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38563:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38584) Unify the data validation

2022-03-17 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-38584:
-
Description: 
1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly and return model containing invalid 
coefficients, like LinearSVC
 * the training may fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 
different files:

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils,

org.apache.spark.ml.Predictor, etc.

 

I think it is time to unify realtive methods to one source file.

 

  was:
1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly and return model invalid coefficients, like 
LinearSVC
 * the training will fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 
different files:

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils,

org.apache.spark.ml.Predictor, etc.

 

I think it is time to unify realtive methods to one source file.

 


> Unify the data validation
> -
>
> Key: SPARK-38584
> URL: https://issues.apache.org/jira/browse/SPARK-38584
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> 1, input vector validation is missing in most algorithms, when the input 
> dataset contains some invalid values (NaN/Infinity), then:
>  * the training may run successfuly and return model containing invalid 
> coefficients, like LinearSVC
>  * the training may fail with irrelevant message, like KMeans
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
> val km = new KMeans().setK(2)
> scala> km.fit(df)
> 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 
> 113)
> java.lang.IllegalArgumentException: requirement failed: Both norms should be 
> greater or equal to 0.0, found norm1=NaN, norm2=Infinity
>     at scala.Predef$.require(Predef.scala:281)
>     at 
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
> {code}
>  
> 2, relative methods to validate input dataset (like labels/weights) exists in 
> 

[jira] [Reopened] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-38563:
--

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508524#comment-17508524
 ] 

Apache Spark commented on SPARK-38563:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35907

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38563:
-
Fix Version/s: (was: 3.2.2)

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38563) Upgrade to Py4J 0.10.9.5

2022-03-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38563:
-
Summary: Upgrade to Py4J 0.10.9.5  (was: Upgrade to Py4J 0.10.9.4)

> Upgrade to Py4J 0.10.9.5
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33236) Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508501#comment-17508501
 ] 

Apache Spark commented on SPARK-33236:
--

User 'zhouyejoe' has created a pull request for this issue:
https://github.com/apache/spark/pull/35906

> Enable Push-based shuffle service to store state in NM level DB for work 
> preserving restart
> ---
>
> Key: SPARK-33236
> URL: https://issues.apache.org/jira/browse/SPARK-33236
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33236) Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508499#comment-17508499
 ] 

Apache Spark commented on SPARK-33236:
--

User 'zhouyejoe' has created a pull request for this issue:
https://github.com/apache/spark/pull/35906

> Enable Push-based shuffle service to store state in NM level DB for work 
> preserving restart
> ---
>
> Key: SPARK-33236
> URL: https://issues.apache.org/jira/browse/SPARK-33236
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33236) Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-03-17 Thread Ye Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508498#comment-17508498
 ] 

Ye Zhou commented on SPARK-33236:
-

WIP PR posted [https://github.com/apache/spark/pull/35906.] 

> Enable Push-based shuffle service to store state in NM level DB for work 
> preserving restart
> ---
>
> Key: SPARK-33236
> URL: https://issues.apache.org/jira/browse/SPARK-33236
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508492#comment-17508492
 ] 

Dongjoon Hyun commented on SPARK-38563:
---

This is reverted from `master` and `branch-3.3`.

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38563:
--
Fix Version/s: (was: 3.3.0)
   (was: 3.2.2)

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38563.
---
Fix Version/s: 3.2.2
 Assignee: Hyukjin Kwon
   Resolution: Fixed

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-38563:
---
  Assignee: (was: Hyukjin Kwon)

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Dennis Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Du updated SPARK-38592:
--
Description: 
Try to modify the data frame to ensure column names have no special characters.
{code:java}
df.columns.map { columnName =>
   df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}
{code}
*surroundingBackTickedName()* will enclose column name with backticks.

However, col() kept having issue with column name contains back tick because 
{*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
wondering if there is a workaround
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}

  was:
Try to modify the data frame to ensure column names have no special characters.
{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
*surroundingBackTickedName()* will enclose column name with backticks.

However, col() kept having issue with column name contains back tick because 
{*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
wondering if there is a workaround
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}


> Column name contains back tick `
> 
>
> Key: SPARK-38592
> URL: https://issues.apache.org/jira/browse/SPARK-38592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Dennis Du
>Priority: Major
>  Labels: bulk-closed
>
> Try to modify the data frame to ensure column names have no special 
> characters.
> {code:java}
> df.columns.map { columnName =>
>df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
> }
> {code}
> *surroundingBackTickedName()* will enclose column name with backticks.
> However, col() kept having issue with column name contains back tick because 
> {*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
> wondering if there is a workaround
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
> COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
> COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
> COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508471#comment-17508471
 ] 

Apache Spark commented on SPARK-38593:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35905

> Incorporate numRowsDroppedByWatermark metric from 
> SessionWindowStateStoreRestoreExec into StateOperatorProgress
> ---
>
> Key: SPARK-38593
> URL: https://issues.apache.org/jira/browse/SPARK-38593
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Although we added `numRowsDroppedByWatermark` to 
> `SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
> will be collected metrics for `StateOperatorProgress`. So if we need 
> `numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
> used in streaming listener, we need to incorporate 
> `SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38593:


Assignee: Apache Spark

> Incorporate numRowsDroppedByWatermark metric from 
> SessionWindowStateStoreRestoreExec into StateOperatorProgress
> ---
>
> Key: SPARK-38593
> URL: https://issues.apache.org/jira/browse/SPARK-38593
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Although we added `numRowsDroppedByWatermark` to 
> `SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
> will be collected metrics for `StateOperatorProgress`. So if we need 
> `numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
> used in streaming listener, we need to incorporate 
> `SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38593:


Assignee: Apache Spark

> Incorporate numRowsDroppedByWatermark metric from 
> SessionWindowStateStoreRestoreExec into StateOperatorProgress
> ---
>
> Key: SPARK-38593
> URL: https://issues.apache.org/jira/browse/SPARK-38593
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Although we added `numRowsDroppedByWatermark` to 
> `SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
> will be collected metrics for `StateOperatorProgress`. So if we need 
> `numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
> used in streaming listener, we need to incorporate 
> `SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38593:


Assignee: (was: Apache Spark)

> Incorporate numRowsDroppedByWatermark metric from 
> SessionWindowStateStoreRestoreExec into StateOperatorProgress
> ---
>
> Key: SPARK-38593
> URL: https://issues.apache.org/jira/browse/SPARK-38593
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Although we added `numRowsDroppedByWatermark` to 
> `SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
> will be collected metrics for `StateOperatorProgress`. So if we need 
> `numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
> used in streaming listener, we need to incorporate 
> `SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508470#comment-17508470
 ] 

Apache Spark commented on SPARK-38593:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35905

> Incorporate numRowsDroppedByWatermark metric from 
> SessionWindowStateStoreRestoreExec into StateOperatorProgress
> ---
>
> Key: SPARK-38593
> URL: https://issues.apache.org/jira/browse/SPARK-38593
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Although we added `numRowsDroppedByWatermark` to 
> `SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
> will be collected metrics for `StateOperatorProgress`. So if we need 
> `numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
> used in streaming listener, we need to incorporate 
> `SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38593) Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress

2022-03-17 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-38593:
---

 Summary: Incorporate numRowsDroppedByWatermark metric from 
SessionWindowStateStoreRestoreExec into StateOperatorProgress
 Key: SPARK-38593
 URL: https://issues.apache.org/jira/browse/SPARK-38593
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: L. C. Hsieh


Although we added `numRowsDroppedByWatermark` to 
`SessionWindowStateStoreRestoreExec`, but currently only `StateStoreWriter` 
will be collected metrics for `StateOperatorProgress`. So if we need 
`numRowsDroppedByWatermark` from `SessionWindowStateStoreRestoreExec` to be 
used in streaming listener, we need to incorporate 
`SessionWindowStateStoreRestoreExec` into `StateOperatorProgress`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37425) Inline type hints for python/pyspark/mllib/recommendation.py

2022-03-17 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37425.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35766
[https://github.com/apache/spark/pull/35766]

> Inline type hints for python/pyspark/mllib/recommendation.py
> 
>
> Key: SPARK-37425
> URL: https://issues.apache.org/jira/browse/SPARK-37425
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.4.0
>
>
> Inline type hints from python/pyspark/mlib/recommendation.pyi to 
> python/pyspark/mllib/recommendation.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37425) Inline type hints for python/pyspark/mllib/recommendation.py

2022-03-17 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37425:
--

Assignee: dch nguyen

> Inline type hints for python/pyspark/mllib/recommendation.py
> 
>
> Key: SPARK-37425
> URL: https://issues.apache.org/jira/browse/SPARK-37425
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: dch nguyen
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/recommendation.pyi to 
> python/pyspark/mllib/recommendation.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2022-03-17 Thread Nicolas Luiz Ribeiro Veiga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508438#comment-17508438
 ] 

Nicolas Luiz Ribeiro Veiga commented on SPARK-2489:
---

Can we reopen this issue? I rebased and submitted the change from aws-awinstan.

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 2.2.0
>Reporter: Pei-Lun Lee
>Priority: Major
>  Labels: bulk-closed
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37244) Build and test on Python 3.10

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37244:
--
Labels: releasenotes  (was: )

> Build and test on Python 3.10
> -
>
> Key: SPARK-37244
> URL: https://issues.apache.org/jira/browse/SPARK-37244
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> Python 3.10 introduced breaking changes. We need to update PySpark.
> - https://docs.python.org/3/whatsnew/3.10.html
> For example, the following.
> {code}
> >>> from collections import Callable
> :1: DeprecationWarning: Using or importing the ABCs from 'collections' 
> instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 
> it will stop working
> {code}
> This is targeting Apache Spark 3.3.0 in 2022.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508433#comment-17508433
 ] 

Apache Spark commented on SPARK-38563:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35904

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508429#comment-17508429
 ] 

Apache Spark commented on SPARK-38563:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35903

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38563) Upgrade to Py4J 0.10.9.4

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508430#comment-17508430
 ] 

Apache Spark commented on SPARK-38563:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35903

> Upgrade to Py4J 0.10.9.4
> 
>
> Key: SPARK-38563
> URL: https://issues.apache.org/jira/browse/SPARK-38563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
>
> There is a resource leak bug, see https://github.com/py4j/py4j/pull/471. We 
> should upgrade Py4J to 0.10.9.4 to fix this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508407#comment-17508407
 ] 

Apache Spark commented on SPARK-2489:
-

User 'nicolaslrveiga' has created a pull request for this issue:
https://github.com/apache/spark/pull/35902

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 2.2.0
>Reporter: Pei-Lun Lee
>Priority: Major
>  Labels: bulk-closed
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508406#comment-17508406
 ] 

Apache Spark commented on SPARK-2489:
-

User 'nicolaslrveiga' has created a pull request for this issue:
https://github.com/apache/spark/pull/35902

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 2.2.0
>Reporter: Pei-Lun Lee
>Priority: Major
>  Labels: bulk-closed
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38194) Make memory overhead factor configurable

2022-03-17 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508405#comment-17508405
 ] 

Dongjoon Hyun commented on SPARK-38194:
---

This is reverted from branch-3.3 via https://github.com/apache/spark/pull/35900

> Make memory overhead factor configurable
> 
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Mesos, YARN
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38194) Make memory overhead factor configurable

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38194:
--
Summary: Make memory overhead factor configurable  (was: Make Yarn memory 
overhead factor configurable)

> Make memory overhead factor configurable
> 
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38194:
--
Affects Version/s: 3.4.0
   (was: 3.2.1)

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38194) Make memory overhead factor configurable

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38194:
--
Component/s: Kubernetes
 Mesos

> Make memory overhead factor configurable
> 
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Mesos, YARN
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38194:
--
Fix Version/s: 3.4.0
   (was: 3.3.0)

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508370#comment-17508370
 ] 

Apache Spark commented on SPARK-38194:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/35901

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508369#comment-17508369
 ] 

Apache Spark commented on SPARK-38194:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/35901

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508358#comment-17508358
 ] 

Apache Spark commented on SPARK-38194:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/35900

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Dennis Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Du updated SPARK-38592:
--
Description: 
Try to modify the data frame to ensure column names have no special characters.
{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
*surroundingBackTickedName()* will enclose column name with backticks.

However, col() kept having issue with column name contains back tick because 
{*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
wondering if there is a workaround
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}

  was:
Try to modify the data frame to ensure column names have no special characters.


{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
col() kept having issue with column name contains back tick
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}


> Column name contains back tick `
> 
>
> Key: SPARK-38592
> URL: https://issues.apache.org/jira/browse/SPARK-38592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Dennis Du
>Priority: Major
>  Labels: bulk-closed
>
> Try to modify the data frame to ensure column names have no special 
> characters.
> {code:java}
> val newdf = df.select(
> df.columns.map { columnName =>
> df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
> }: _*
> ){code}
> *surroundingBackTickedName()* will enclose column name with backticks.
> However, col() kept having issue with column name contains back tick because 
> {*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
> wondering if there is a workaround
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
> COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
> COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
> COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Dennis Du (Jira)
Dennis Du created SPARK-38592:
-

 Summary: Column name contains back tick `
 Key: SPARK-38592
 URL: https://issues.apache.org/jira/browse/SPARK-38592
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Dennis Du


Try to modify the data frame to ensure column names have no special characters.


{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
col() kept having issue with column name contains back tick
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35066:

Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|https://lists.apache.org/thread/1hlg9fpxnw8dzx8bd2fvffmk7yozoszf]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|https://lists.apache.org/thread/1bslwjdwnr5tw7wjkv0672vj41x4g2f1]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35066:

Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|https://lists.apache.org/thread/1bslwjdwnr5tw7wjkv0672vj41x4g2f1]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed all 
together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task: [Spark UI 
3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]

Screenshot of spark 3.0.2 task: [Spark UI 
3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: image-2022-03-17-17-19-34-906.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, 
> image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: image-2022-03-17-17-19-11-655.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, 
> image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: Screenshot 2021-04-08 at 15.13.19-1.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: image-2022-03-17-17-18-36-793.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38577) Interval types are not truncated to the expected endField when creating a DataFrame via Duration

2022-03-17 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508269#comment-17508269
 ] 

Robert Joseph Evans commented on SPARK-38577:
-

This is especially problematic because it is really inconsistent.
{code:scala}
val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)), 
Row(Duration.ofDays(2).plusMinutes(2)))

val schema = StructType(Array(StructField("dur", 
DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.selectExpr("dur", "CAST(dur AS long)", "CAST('1970-1-1' as timestamp) + dur 
as ts").show()
++---+---+
|             dur|dur|                 ts|
++---+---+
|INTERVAL '1' DAY|  1|1970-01-02 00:00:01|
|INTERVAL '2' DAY|  2|1970-01-03 00:02:00|
++---+---+
 
df.select(col("dur"), 
col("dur").cast(DayTimeIntervalType()).alias("default_dur")).show(truncate = 
false)
++---+
|dur             |default_dur                        |
++---+
|INTERVAL '1' DAY|INTERVAL '1 00:00:01' DAY TO SECOND|
|INTERVAL '2' DAY|INTERVAL '2 00:02:00' DAY TO SECOND|
++---+
{code}
Casting the values to different types will truncate it if dropping precision, 
but increasing precision or doing math with it does not.

Saving the data to parquet keeps the data exactly the same as was input, but 
doing it to CSV truncates it.
{code:scala}
df.write.parquet("./tmp")

val df2 = spark.read.parquet("./tmp")

df2.selectExpr("dur", "CAST(dur AS long)", "CAST('1970-1-1' as timestamp) + dur 
as ts").show()
++---+---+
|             dur|dur|                 ts|
++---+---+
|INTERVAL '1' DAY|  1|1970-01-02 00:00:01|
|INTERVAL '2' DAY|  2|1970-01-03 00:02:00|
++---+---+

 

df.write.csv("./tmp_csv")

val df3 = spark.read.schema(schema).csv("./tmp_csv")

df3.selectExpr("dur", "CAST(dur AS long)", "CAST('1970-1-1' as timestamp) + dur 
as ts").show()
++---+---+
|             dur|dur|                 ts|
++---+---+
|INTERVAL '2' DAY|  2|1970-01-03 00:00:00|
|INTERVAL '1' DAY|  1|1970-01-02 00:00:00|
++---+---+
 {code}
This is all also true in the python API.

 

I would expect to get an error when importing the data, or have Spark 
truncate/fix the data when it is imported so I don't get inconsistent and 
confusing results with it.

 

If this works as expected, then I would like to see it documented better what 
is happening.

> Interval types are not truncated to the expected endField when creating a 
> DataFrame via Duration
> 
>
> Key: SPARK-38577
> URL: https://issues.apache.org/jira/browse/SPARK-38577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot version
>  
>Reporter: chong
>Priority: Major
>
> *Problem:*
> ANSI interval types are store as long internally.
> The long value are not truncated to the expected endField when creating a 
> DataFrame via Duration.
>  
> *Reproduce:*
> Create a "day to day" interval, the seconds are not truncated, see below code.
> The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1)  * 
> 100{*}{*}{{*}}
>  
> {code:java}
>   test("my test") {
> val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)))
> val schema = StructType(Array(
>   StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, 
> DayTimeIntervalType.DAY))
> ))
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), 
> schema)
> df.show()
>   } {code}
>  
>  
> After debug, the {{endField}} is always {{SECOND}} in 
> {{{}durationToMicros{}}}, see below:
>  
> {code:java}
>   // IntervalUtils class
>   def durationToMicros(duration: Duration): Long = {
> durationToMicros(duration, DT.SECOND)   // always SECOND
>   }
>   def durationToMicros(duration: Duration, endField: Byte)
> {code}
> Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND]
> Or Spark can throw an exception to avoid truncating.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38591) Add flatMapSortedGroups to KeyValueGroupedDataset

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38591:


Assignee: (was: Apache Spark)

>  Add flatMapSortedGroups to KeyValueGroupedDataset
> --
>
> Key: SPARK-38591
> URL: https://issues.apache.org/jira/browse/SPARK-38591
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Priority: Major
>
> The existing method {{KeyValueGroupedDataset.flatMapGroups}} provides an 
> iterator of rows for each group key. If user code would requires those rows 
> in a particular order, that iterator would have to be sorted first, which is 
> against the idea of an iterator in the first place. For groups that do not 
> fit into memory of one executor, this approach does not work.
> [org.apache.spark.sql.KeyValueGroupedDataset|https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137]:
> {noformat}
> Internally, the implementation will spill to disk if any given group is too 
> large to fit into
> memory. However, users must take care to avoid materializing the whole 
> iterator for a group
> (for example, by calling `toList`) unless they are sure that this is possible 
> given the memory
> constraints of their cluster.
> {noformat}
> The implementation of {{KeyValueGroupedDataset.flatMapGroups}} already sorts 
> each partition according to the group key. By additionally sorting by some 
> data columns, the iterator can be guaranteed to provide some order.
> A new method {{KeyValueGroupedDataset.flatMapSortedGroups}} could allow to 
> define order within the groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38591) Add flatMapSortedGroups to KeyValueGroupedDataset

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38591:


Assignee: Apache Spark

>  Add flatMapSortedGroups to KeyValueGroupedDataset
> --
>
> Key: SPARK-38591
> URL: https://issues.apache.org/jira/browse/SPARK-38591
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>
> The existing method {{KeyValueGroupedDataset.flatMapGroups}} provides an 
> iterator of rows for each group key. If user code would requires those rows 
> in a particular order, that iterator would have to be sorted first, which is 
> against the idea of an iterator in the first place. For groups that do not 
> fit into memory of one executor, this approach does not work.
> [org.apache.spark.sql.KeyValueGroupedDataset|https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137]:
> {noformat}
> Internally, the implementation will spill to disk if any given group is too 
> large to fit into
> memory. However, users must take care to avoid materializing the whole 
> iterator for a group
> (for example, by calling `toList`) unless they are sure that this is possible 
> given the memory
> constraints of their cluster.
> {noformat}
> The implementation of {{KeyValueGroupedDataset.flatMapGroups}} already sorts 
> each partition according to the group key. By additionally sorting by some 
> data columns, the iterator can be guaranteed to provide some order.
> A new method {{KeyValueGroupedDataset.flatMapSortedGroups}} could allow to 
> define order within the groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38591) Add flatMapSortedGroups to KeyValueGroupedDataset

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508265#comment-17508265
 ] 

Apache Spark commented on SPARK-38591:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/35899

>  Add flatMapSortedGroups to KeyValueGroupedDataset
> --
>
> Key: SPARK-38591
> URL: https://issues.apache.org/jira/browse/SPARK-38591
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Priority: Major
>
> The existing method {{KeyValueGroupedDataset.flatMapGroups}} provides an 
> iterator of rows for each group key. If user code would requires those rows 
> in a particular order, that iterator would have to be sorted first, which is 
> against the idea of an iterator in the first place. For groups that do not 
> fit into memory of one executor, this approach does not work.
> [org.apache.spark.sql.KeyValueGroupedDataset|https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137]:
> {noformat}
> Internally, the implementation will spill to disk if any given group is too 
> large to fit into
> memory. However, users must take care to avoid materializing the whole 
> iterator for a group
> (for example, by calling `toList`) unless they are sure that this is possible 
> given the memory
> constraints of their cluster.
> {noformat}
> The implementation of {{KeyValueGroupedDataset.flatMapGroups}} already sorts 
> each partition according to the group key. By additionally sorting by some 
> data columns, the iterator can be guaranteed to provide some order.
> A new method {{KeyValueGroupedDataset.flatMapSortedGroups}} could allow to 
> define order within the groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38591) Add flatMapSortedGroups to KeyValueGroupedDataset

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38591:


Assignee: Apache Spark

>  Add flatMapSortedGroups to KeyValueGroupedDataset
> --
>
> Key: SPARK-38591
> URL: https://issues.apache.org/jira/browse/SPARK-38591
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>
> The existing method {{KeyValueGroupedDataset.flatMapGroups}} provides an 
> iterator of rows for each group key. If user code would requires those rows 
> in a particular order, that iterator would have to be sorted first, which is 
> against the idea of an iterator in the first place. For groups that do not 
> fit into memory of one executor, this approach does not work.
> [org.apache.spark.sql.KeyValueGroupedDataset|https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137]:
> {noformat}
> Internally, the implementation will spill to disk if any given group is too 
> large to fit into
> memory. However, users must take care to avoid materializing the whole 
> iterator for a group
> (for example, by calling `toList`) unless they are sure that this is possible 
> given the memory
> constraints of their cluster.
> {noformat}
> The implementation of {{KeyValueGroupedDataset.flatMapGroups}} already sorts 
> each partition according to the group key. By additionally sorting by some 
> data columns, the iterator can be guaranteed to provide some order.
> A new method {{KeyValueGroupedDataset.flatMapSortedGroups}} could allow to 
> define order within the groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38591) Add flatMapSortedGroups to KeyValueGroupedDataset

2022-03-17 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-38591:
-

 Summary:  Add flatMapSortedGroups to KeyValueGroupedDataset
 Key: SPARK-38591
 URL: https://issues.apache.org/jira/browse/SPARK-38591
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Enrico Minack


The existing method {{KeyValueGroupedDataset.flatMapGroups}} provides an 
iterator of rows for each group key. If user code would requires those rows in 
a particular order, that iterator would have to be sorted first, which is 
against the idea of an iterator in the first place. For groups that do not fit 
into memory of one executor, this approach does not work.

[org.apache.spark.sql.KeyValueGroupedDataset|https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137]:
{noformat}
Internally, the implementation will spill to disk if any given group is too 
large to fit into
memory. However, users must take care to avoid materializing the whole iterator 
for a group
(for example, by calling `toList`) unless they are sure that this is possible 
given the memory
constraints of their cluster.
{noformat}


The implementation of {{KeyValueGroupedDataset.flatMapGroups}} already sorts 
each partition according to the group key. By additionally sorting by some data 
columns, the iterator can be guaranteed to provide some order.

A new method {{KeyValueGroupedDataset.flatMapSortedGroups}} could allow to 
define order within the groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38544) Upgrade log4j2 to 2.17.2

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508195#comment-17508195
 ] 

Apache Spark commented on SPARK-38544:
--

User 'jackylee-ch' has created a pull request for this issue:
https://github.com/apache/spark/pull/35898

> Upgrade log4j2 to 2.17.2
> 
>
> Key: SPARK-38544
> URL: https://issues.apache.org/jira/browse/SPARK-38544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Log4j 2.17.2 has been released to:
>  * Over 50 improvements and fixes to the Log4j 1.x support. Continued testing 
> has shown it is a suitable replacement for Log4j 1.x in most cases.
>  * Scripting now requires a system property be specified naming the languages 
> the user wishes to allow. The scripting engine will not load if the property 
> isn't set.
>  * By default, the only remote protocol allowed for loading configuration 
> files is HTTPS. Users can specify a system property to allow others or 
> prevent remote loading entirely.
>  * Variable resolution has been modified so that only properties defined as 
> properties in the configuration file can be recursive. All other Lookups are 
> now non-recursive. This addresses issues users were having resolving lookups 
> specified in property definitions for use in the RoutingAppender and 
> RollingFileAppender due to restrictions put in place in 2.17.1.
>  * Many other fixes and improvements.
>  
> The change report as follows:
> https://logging.apache.org/log4j/2.x/changes-report.html#a2.17.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38590) New SQL function: try_to_binary

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38590:


Assignee: Apache Spark  (was: Gengliang Wang)

> New SQL function: try_to_binary
> ---
>
> Key: SPARK-38590
> URL: https://issues.apache.org/jira/browse/SPARK-38590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38590) New SQL function: try_to_binary

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508158#comment-17508158
 ] 

Apache Spark commented on SPARK-38590:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35897

> New SQL function: try_to_binary
> ---
>
> Key: SPARK-38590
> URL: https://issues.apache.org/jira/browse/SPARK-38590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38590) New SQL function: try_to_binary

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38590:


Assignee: Gengliang Wang  (was: Apache Spark)

> New SQL function: try_to_binary
> ---
>
> Key: SPARK-38590
> URL: https://issues.apache.org/jira/browse/SPARK-38590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38590) New SQL function: try_to_binary

2022-03-17 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38590:
--

 Summary: New SQL function: try_to_binary
 Key: SPARK-38590
 URL: https://issues.apache.org/jira/browse/SPARK-38590
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38589) New SQL function: try_avg

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38589:


Assignee: Apache Spark  (was: Gengliang Wang)

> New SQL function: try_avg
> -
>
> Key: SPARK-38589
> URL: https://issues.apache.org/jira/browse/SPARK-38589
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38589) New SQL function: try_avg

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38589:


Assignee: Gengliang Wang  (was: Apache Spark)

> New SQL function: try_avg
> -
>
> Key: SPARK-38589
> URL: https://issues.apache.org/jira/browse/SPARK-38589
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38589) New SQL function: try_avg

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508156#comment-17508156
 ] 

Apache Spark commented on SPARK-38589:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35896

> New SQL function: try_avg
> -
>
> Key: SPARK-38589
> URL: https://issues.apache.org/jira/browse/SPARK-38589
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38589) New SQL function: try_avg

2022-03-17 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38589:
--

 Summary: New SQL function: try_avg
 Key: SPARK-38589
 URL: https://issues.apache.org/jira/browse/SPARK-38589
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38587) Validating new location for rename command should use formatted names

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508135#comment-17508135
 ] 

Apache Spark commented on SPARK-38587:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35895

> Validating new location for rename command should use formatted names
> -
>
> Key: SPARK-38587
> URL: https://issues.apache.org/jira/browse/SPARK-38587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> {code:java}
> [info] - ALTER TABLE .. RENAME using V1 catalog V1 command: newName *** 
> FAILED *** (61 milliseconds)
> [info]   org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
> Database 'CaseUpperCaseLower' not found
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getDatabase(InMemoryCatalog.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateNewLocationOfRename(SessionCatalog.scala:1863)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.renameTable(SessionCatalog.scala:739)
> [info]   at 
> org.apache.spark.sql.execution.command.AlterTableRenameCommand.run(tables.scala:209
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:467)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:220)
> [info]   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
> [info]   at 
> 

[jira] [Assigned] (SPARK-38587) Validating new location for rename command should use formatted names

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38587:


Assignee: (was: Apache Spark)

> Validating new location for rename command should use formatted names
> -
>
> Key: SPARK-38587
> URL: https://issues.apache.org/jira/browse/SPARK-38587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> {code:java}
> [info] - ALTER TABLE .. RENAME using V1 catalog V1 command: newName *** 
> FAILED *** (61 milliseconds)
> [info]   org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
> Database 'CaseUpperCaseLower' not found
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getDatabase(InMemoryCatalog.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateNewLocationOfRename(SessionCatalog.scala:1863)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.renameTable(SessionCatalog.scala:739)
> [info]   at 
> org.apache.spark.sql.execution.command.AlterTableRenameCommand.run(tables.scala:209
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:467)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:220)
> [info]   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
> [info]   at 
> org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622)
> [info]   at 
> 

[jira] [Commented] (SPARK-38587) Validating new location for rename command should use formatted names

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508134#comment-17508134
 ] 

Apache Spark commented on SPARK-38587:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35895

> Validating new location for rename command should use formatted names
> -
>
> Key: SPARK-38587
> URL: https://issues.apache.org/jira/browse/SPARK-38587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> {code:java}
> [info] - ALTER TABLE .. RENAME using V1 catalog V1 command: newName *** 
> FAILED *** (61 milliseconds)
> [info]   org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
> Database 'CaseUpperCaseLower' not found
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getDatabase(InMemoryCatalog.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateNewLocationOfRename(SessionCatalog.scala:1863)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.renameTable(SessionCatalog.scala:739)
> [info]   at 
> org.apache.spark.sql.execution.command.AlterTableRenameCommand.run(tables.scala:209
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:467)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:220)
> [info]   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
> [info]   at 
> 

[jira] [Assigned] (SPARK-38587) Validating new location for rename command should use formatted names

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38587:


Assignee: Apache Spark

> Validating new location for rename command should use formatted names
> -
>
> Key: SPARK-38587
> URL: https://issues.apache.org/jira/browse/SPARK-38587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> ```
> {code:java}
> [info] - ALTER TABLE .. RENAME using V1 catalog V1 command: newName *** 
> FAILED *** (61 milliseconds)
> [info]   org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
> Database 'CaseUpperCaseLower' not found
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getDatabase(InMemoryCatalog.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateNewLocationOfRename(SessionCatalog.scala:1863)
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.renameTable(SessionCatalog.scala:739)
> [info]   at 
> org.apache.spark.sql.execution.command.AlterTableRenameCommand.run(tables.scala:209
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:491)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:467)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:220)
> [info]   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
> [info]   at 
> org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622)
> [info] 

[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508128#comment-17508128
 ] 

Apache Spark commented on SPARK-38575:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35894

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38588) Validate input dataset of LinearSVC

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38588:


Assignee: (was: Apache Spark)

> Validate input dataset of LinearSVC
> ---
>
> Key: SPARK-38588
> URL: https://issues.apache.org/jira/browse/SPARK-38588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> LinearSVC should fail fast if the input dataset contains invalid values.
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38588) Validate input dataset of LinearSVC

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508118#comment-17508118
 ] 

Apache Spark commented on SPARK-38588:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35893

> Validate input dataset of LinearSVC
> ---
>
> Key: SPARK-38588
> URL: https://issues.apache.org/jira/browse/SPARK-38588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> LinearSVC should fail fast if the input dataset contains invalid values.
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38588) Validate input dataset of LinearSVC

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38588:


Assignee: Apache Spark

> Validate input dataset of LinearSVC
> ---
>
> Key: SPARK-38588
> URL: https://issues.apache.org/jira/browse/SPARK-38588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> LinearSVC should fail fast if the input dataset contains invalid values.
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38588) Validate input dataset of LinearSVC

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508119#comment-17508119
 ] 

Apache Spark commented on SPARK-38588:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35893

> Validate input dataset of LinearSVC
> ---
>
> Key: SPARK-38588
> URL: https://issues.apache.org/jira/browse/SPARK-38588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> LinearSVC should fail fast if the input dataset contains invalid values.
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38588) Validate input dataset of LinearSVC

2022-03-17 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-38588:


 Summary: Validate input dataset of LinearSVC
 Key: SPARK-38588
 URL: https://issues.apache.org/jira/browse/SPARK-38588
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: zhengruifeng


LinearSVC should fail fast if the input dataset contains invalid values.

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38587) Validating new location for rename command should use formatted names

2022-03-17 Thread Kent Yao (Jira)
Kent Yao created SPARK-38587:


 Summary: Validating new location for rename command should use 
formatted names
 Key: SPARK-38587
 URL: https://issues.apache.org/jira/browse/SPARK-38587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.1.2, 3.0.3, 3.3.0
Reporter: Kent Yao


```
{code:java}
[info] - ALTER TABLE .. RENAME using V1 catalog V1 command: newName *** FAILED 
*** (61 milliseconds)
[info]   org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
Database 'CaseUpperCaseLower' not found
[info]   at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
[info]   at 
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:47)
[info]   at 
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getDatabase(InMemoryCatalog.scala:171)
[info]   at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
[info]   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateNewLocationOfRename(SessionCatalog.scala:1863)
[info]   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.renameTable(SessionCatalog.scala:739)
[info]   at 
org.apache.spark.sql.execution.command.AlterTableRenameCommand.run(tables.scala:209
[info]   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
[info]   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
[info]   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
[info]   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
[info]   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
[info]   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
[info]   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
[info]   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
[info]   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[info]   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
[info]   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:491)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:491)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:467)
[info]   at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
[info]   at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
[info]   at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
[info]   at org.apache.spark.sql.Dataset.(Dataset.scala:220)
[info]   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
[info]   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
[info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
[info]   at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622)
[info]   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
[info]   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
[info]   at 
org.apache.spark.sql.test.SQLTestUtilsBase.$anonfun$sql$1(SQLTestUtils.scala:232)
[info]   at 
org.apache.spark.sql.execution.command.AlterTableRenameSuiteBase.$anonfun$$init$$19(AlterTableRenameSuiteBase.scala:143)
[info]   at 

[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508096#comment-17508096
 ] 

Apache Spark commented on SPARK-38575:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35892

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38586) Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38586.
--
Fix Version/s: 3.3.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/35891

> Trigger notifying workflow in branch-3.3 and other future branches
> --
>
> Key: SPARK-38586
> URL: https://issues.apache.org/jira/browse/SPARK-38586
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> See https://github.com/apache/spark/pull/35891



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38586) Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38586:


Assignee: (was: Apache Spark)

> Trigger notifying workflow in branch-3.3 and other future branches
> --
>
> Key: SPARK-38586
> URL: https://issues.apache.org/jira/browse/SPARK-38586
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/35891



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38586) Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38586:


Assignee: Apache Spark

> Trigger notifying workflow in branch-3.3 and other future branches
> --
>
> Key: SPARK-38586
> URL: https://issues.apache.org/jira/browse/SPARK-38586
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See https://github.com/apache/spark/pull/35891



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38586) Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508087#comment-17508087
 ] 

Apache Spark commented on SPARK-38586:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35891

> Trigger notifying workflow in branch-3.3 and other future branches
> --
>
> Key: SPARK-38586
> URL: https://issues.apache.org/jira/browse/SPARK-38586
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/35891



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38586) Trigger notifying workflow in branch-3.3 and other future branches

2022-03-17 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38586:


 Summary: Trigger notifying workflow in branch-3.3 and other future 
branches
 Key: SPARK-38586
 URL: https://issues.apache.org/jira/browse/SPARK-38586
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


See https://github.com/apache/spark/pull/35891



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508085#comment-17508085
 ] 

Apache Spark commented on SPARK-38575:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35891

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38566) Revert the parser changes for DEFAULT column support

2022-03-17 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38566:


Assignee: Max Gekk

> Revert the parser changes for DEFAULT column support
> 
>
> Key: SPARK-38566
> URL: https://issues.apache.org/jira/browse/SPARK-38566
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Blocker
>
> Revert the commit 
> https://github.com/apache/spark/commit/e21cb62d02c85a66771822cdd49c49dbb3e44502
>  from branch-3.3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38566) Revert the parser changes for DEFAULT column support

2022-03-17 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38566.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35885
[https://github.com/apache/spark/pull/35885]

> Revert the parser changes for DEFAULT column support
> 
>
> Key: SPARK-38566
> URL: https://issues.apache.org/jira/browse/SPARK-38566
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Blocker
> Fix For: 3.3.0
>
>
> Revert the commit 
> https://github.com/apache/spark/commit/e21cb62d02c85a66771822cdd49c49dbb3e44502
>  from branch-3.3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >