[jira] [Updated] (SPARK-48382) Add controller / reconciler module to operator
[ https://issues.apache.org/jira/browse/SPARK-48382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48382: --- Labels: pull-request-available (was: ) > Add controller / reconciler module to operator > -- > > Key: SPARK-48382 > URL: https://issues.apache.org/jira/browse/SPARK-48382 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
[ https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848824#comment-17848824 ] Eric Yang edited comment on SPARK-48397 at 5/23/24 6:38 AM: The PR: https://github.com/apache/spark/pull/46714 was (Author: JIRAUSER304132): I'm working on a PR for it. > Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker > > > Key: SPARK-48397 > URL: https://issues.apache.org/jira/browse/SPARK-48397 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > Labels: pull-request-available > > For FileFormatDataWriter we currently record metrics of "task commit time" > and "job commit time" in > `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. > We may also record the time spent on "data write" (together with the time > spent on producing records from the iterator), which is usually one of the > major parts of the total duration of a writing operation. It helps us > identify the bottleneck and time skew, and also the generic performance > tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
[ https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48397: --- Labels: pull-request-available (was: ) > Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker > > > Key: SPARK-48397 > URL: https://issues.apache.org/jira/browse/SPARK-48397 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > Labels: pull-request-available > > For FileFormatDataWriter we currently record metrics of "task commit time" > and "job commit time" in > `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. > We may also record the time spent on "data write" (together with the time > spent on producing records from the iterator), which is usually one of the > major parts of the total duration of a writing operation. It helps us > identify the bottleneck and time skew, and also the generic performance > tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48396) Support configuring limit control for SQL to use maximum cores
[ https://issues.apache.org/jira/browse/SPARK-48396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48396: --- Labels: pull-request-available (was: ) > Support configuring limit control for SQL to use maximum cores > -- > > Key: SPARK-48396 > URL: https://issues.apache.org/jira/browse/SPARK-48396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Mars >Priority: Major > Labels: pull-request-available > > When there is a long-running shared Spark SQL cluster, there may be a > situation where a large SQL occupies all the cores of the cluster, affecting > the execution of other SQLs. Therefore, it is hoped that there is a > configuration that can limit the maximum cores used by SQL. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
[ https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848824#comment-17848824 ] Eric Yang commented on SPARK-48397: --- I'm working on a PR for it. > Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker > > > Key: SPARK-48397 > URL: https://issues.apache.org/jira/browse/SPARK-48397 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > > For FileFormatDataWriter we currently record metrics of "task commit time" > and "job commit time" in > `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. > We may also record the time spent on "data write" (together with the time > spent on producing records from the iterator), which is usually one of the > major parts of the total duration of a writing operation. It helps us > identify the bottleneck and time skew, and also the generic performance > tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48398) Add Helm chart for Operator Deployment
Zhou JIANG created SPARK-48398: -- Summary: Add Helm chart for Operator Deployment Key: SPARK-48398 URL: https://issues.apache.org/jira/browse/SPARK-48398 Project: Spark Issue Type: Sub-task Components: k8s Affects Versions: kubernetes-operator-0.1.0 Reporter: Zhou JIANG -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
Eric Yang created SPARK-48397: - Summary: Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker Key: SPARK-48397 URL: https://issues.apache.org/jira/browse/SPARK-48397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Eric Yang For FileFormatDataWriter we currently record metrics of "task commit time" and "job commit time" in `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. We may also record the time spent on "data write" (together with the time spent on producing records from the iterator), which is usually one of the major parts of the total duration of a writing operation. It helps us identify the bottleneck and time skew, and also the generic performance tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48396) Support configuring limit control for SQL to use maximum cores
[ https://issues.apache.org/jira/browse/SPARK-48396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars updated SPARK-48396: - Description: When there is a long-running shared Spark SQL cluster, there may be a situation where a large SQL occupies all the cores of the cluster, affecting the execution of other SQLs. Therefore, it is hoped that there is a configuration that can limit the maximum cores used by SQL. > Support configuring limit control for SQL to use maximum cores > -- > > Key: SPARK-48396 > URL: https://issues.apache.org/jira/browse/SPARK-48396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Mars >Priority: Major > > When there is a long-running shared Spark SQL cluster, there may be a > situation where a large SQL occupies all the cores of the cluster, affecting > the execution of other SQLs. Therefore, it is hoped that there is a > configuration that can limit the maximum cores used by SQL. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48396) Support configuring limit control for SQL to use maximum cores
Mars created SPARK-48396: Summary: Support configuring limit control for SQL to use maximum cores Key: SPARK-48396 URL: https://issues.apache.org/jira/browse/SPARK-48396 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Mars -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48370. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46683 [https://github.com/apache/spark/pull/46683] > Checkpoint and localCheckpoint in Scala Spark Connect client > > > Key: SPARK-48370 > URL: https://issues.apache.org/jira/browse/SPARK-48370 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark > Connect client. We should do it in Scala too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48370: Assignee: Hyukjin Kwon > Checkpoint and localCheckpoint in Scala Spark Connect client > > > Key: SPARK-48370 > URL: https://issues.apache.org/jira/browse/SPARK-48370 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark > Connect client. We should do it in Scala too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate
[ https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48391: --- Labels: pull-request-available (was: ) > use addAll instead of add function in TaskMetrics to accelerate > - > > Key: SPARK-48391 > URL: https://issues.apache.org/jira/browse/SPARK-48391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: jiahong.li >Priority: Major > Labels: pull-request-available > > In the fromAccumulators method of TaskMetrics,we should use ` > tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as > _externalAccums is a instance of CopyOnWriteArrayList -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
[ https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48387. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46701 [https://github.com/apache/spark/pull/46701] > Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE > --- > > Key: SPARK-48387 > URL: https://issues.apache.org/jira/browse/SPARK-48387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48386) Replace JVM assert with JUnit Assert in tests
[ https://issues.apache.org/jira/browse/SPARK-48386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-48386. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46698 [https://github.com/apache/spark/pull/46698] > Replace JVM assert with JUnit Assert in tests > - > > Key: SPARK-48386 > URL: https://issues.apache.org/jira/browse/SPARK-48386 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
[ https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-48387: Assignee: Kent Yao > Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE > --- > > Key: SPARK-48387 > URL: https://issues.apache.org/jira/browse/SPARK-48387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister
[ https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48394: --- Labels: pull-request-available (was: ) > Cleanup mapIdToMapIndex on mapoutput unregister > --- > > Key: SPARK-48394 > URL: https://issues.apache.org/jira/browse/SPARK-48394 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: wuyi >Priority: Major > Labels: pull-request-available > > There is only one valid mapstatus for the same {{mapIndex}} at the same time > in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid > chaos. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister
wuyi created SPARK-48394: Summary: Cleanup mapIdToMapIndex on mapoutput unregister Key: SPARK-48394 URL: https://issues.apache.org/jira/browse/SPARK-48394 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.1, 3.5.0, 4.0.0 Reporter: wuyi There is only one valid mapstatus for the same {{mapIndex}} at the same time in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid chaos. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored
[ https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848787#comment-17848787 ] Bruce Robbins commented on SPARK-48361: --- Did you mean the following? {noformat} val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true) {noformat} Either way (with `=== true` or `=!= true`), a bug of some sort is revealed. With `=== true`, the grouping produces an empty result (it shouldn't). With `=!= true`, the grouping includes `8, 9` (it shouldn't, as you mentioned). In fact, for both cases, if you persist {{dfWithJagged}}, you get the right answer. > Correctness: CSV corrupt record filter with aggregate ignored > - > > Key: SPARK-48361 > URL: https://issues.apache.org/jira/browse/SPARK-48361 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Using spark shell 3.5.1 on M1 Mac >Reporter: Ted Chester Jenks >Priority: Major > > Using corrupt record in CSV parsing for some data cleaning logic, I came > across a correctness bug. > > The following repro can be ran with spark-shell 3.5.1. > *Create test.csv with the following content:* > {code:java} > test,1,2,three > four,5,6,seven > 8,9 > ten,11,12,thirteen {code} > > > *In spark-shell:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > dfWithJagged.show(){code} > *Returns:* > {code:java} > +---+---+---++---+---+ > |column1|column2|column3| column4|_corrupt_record|__is_jagged| > +---+---+---++---+---+ > | four| 5.0| 6.0| seven| NULL| false| > | 8| 9.0| NULL| NULL| 8,9| true| > | ten| 11.0| 12.0|thirteen| NULL| false| > +---+---+---++---+---+ {code} > So far so good... > > *BUT* > > *If we add an aggregate before we show:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > val dfDropped = dfWithJagged.filter(col("__is_jagged") === true) > val groupedSum = > dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2")) > groupedSum.show(){code} > *We get:* > {code:java} > +---+---+ > |column1|sum_column2| > +---+---+ > | 8| 9.0| > | four| 5.0| > | ten| 11.0| > +---+---+ {code} > > *Which is not correct* > > With the addition of the aggregate, the filter down to rows with 3 commas in > the corrupt record column is ignored. This does not happed with any other > operators I have tried - just aggregates so far. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apa
[jira] [Assigned] (SPARK-48393) Move a group of constants to `pyspark.util`
[ https://issues.apache.org/jira/browse/SPARK-48393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48393: Assignee: Ruifeng Zheng > Move a group of constants to `pyspark.util` > --- > > Key: SPARK-48393 > URL: https://issues.apache.org/jira/browse/SPARK-48393 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48393) Move a group of constants to `pyspark.util`
[ https://issues.apache.org/jira/browse/SPARK-48393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48393. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46710 [https://github.com/apache/spark/pull/46710] > Move a group of constants to `pyspark.util` > --- > > Key: SPARK-48393 > URL: https://issues.apache.org/jira/browse/SPARK-48393 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48393) Move a group of constants to `pyspark.util`
[ https://issues.apache.org/jira/browse/SPARK-48393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48393: --- Labels: pull-request-available (was: ) > Move a group of constants to `pyspark.util` > --- > > Key: SPARK-48393 > URL: https://issues.apache.org/jira/browse/SPARK-48393 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48393) Move a group of constants to `pyspark.util`
Ruifeng Zheng created SPARK-48393: - Summary: Move a group of constants to `pyspark.util` Key: SPARK-48393 URL: https://issues.apache.org/jira/browse/SPARK-48393 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
[ https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48392: --- Labels: pull-request-available (was: ) > (Optionally) Load `spark-defaults.conf` when passing configurations to > `spark-submit` through `--properties-file` > - > > Key: SPARK-48392 > URL: https://issues.apache.org/jira/browse/SPARK-48392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Chao Sun >Priority: Major > Labels: pull-request-available > > Currently, when user pass configurations to {{spark-submit.sh}} via > {{--properties-file}}, the {{spark-defaults.conf}} will be completely > ignored. This poses issues for some people, for instance, those using [Spark > on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. > See related issues: > - https://github.com/kubeflow/spark-operator/issues/1183 > - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
Chao Sun created SPARK-48392: Summary: (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file` Key: SPARK-48392 URL: https://issues.apache.org/jira/browse/SPARK-48392 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.1 Reporter: Chao Sun Currently, when user pass configurations to `spark-submit.sh` via `--properties-file`, the `spark-defaults.conf` will be completely ignored. This poses issues for some people, for instance, those using [Spark on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: - https://github.com/kubeflow/spark-operator/issues/1183 - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
[ https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-48392: - Description: Currently, when user pass configurations to {{spark-submit.sh}} via {{--properties-file}}, the {{spark-defaults.conf}} will be completely ignored. This poses issues for some people, for instance, those using [Spark on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: - https://github.com/kubeflow/spark-operator/issues/1183 - https://github.com/kubeflow/spark-operator/issues/1321 was: Currently, when user pass configurations to `spark-submit.sh` via `--properties-file`, the `spark-defaults.conf` will be completely ignored. This poses issues for some people, for instance, those using [Spark on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: - https://github.com/kubeflow/spark-operator/issues/1183 - https://github.com/kubeflow/spark-operator/issues/1321 > (Optionally) Load `spark-defaults.conf` when passing configurations to > `spark-submit` through `--properties-file` > - > > Key: SPARK-48392 > URL: https://issues.apache.org/jira/browse/SPARK-48392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Chao Sun >Priority: Major > > Currently, when user pass configurations to {{spark-submit.sh}} via > {{--properties-file}}, the {{spark-defaults.conf}} will be completely > ignored. This poses issues for some people, for instance, those using [Spark > on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. > See related issues: > - https://github.com/kubeflow/spark-operator/issues/1183 > - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48275) array_sort: Improve documentation for default comparator's behavior for different types
[ https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Braymer-Hayes updated SPARK-48275: --- Description: h1. tl;dr It would be helpful for the documentation for array_sort() to include the default comparator behavior for different array element types, especially structs. It would also be helpful for the {{{\{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE error to recommend using a custom comparator instead of the default comparator, especially when sorting on a complex type (e.g., a struct containing an unorderable field, like a map). h1. Background The default comparator for {{array_sort()}} for struct elements is to sort by every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., fieldN). This requires every field to be orderable: if they aren't an error occurs. Here's a small example: {code:java} import pyspark.sql.functions as F import pyspark.sql.types as T schema = T.StructType([ T.StructField( 'value', T.ArrayType( T.StructType([ T.StructField('orderable', T.IntegerType(), True), T.StructField('unorderable', T.MapType(T.StringType(), T.StringType(), True), True), # remove this field and both commands below succeed ]), False ), False ) ]) df = spark.createDataFrame([], schema=schema) df.select(F.array_sort(df['value'])).collect(){code} Output: {code:java} [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code} If the default comparator doesn't work for a user (e.g., they have an unorderable field like a map in their struct), array_sort() accepts a custom comparator, where users can order array elements however they like. Building on the previous example: {code:java} import pyspark.sql as psql def comparator(l: psql.Column, r: psql.Column) -> psql.Column: """Order structs l and r by order field. Rules: * Nulls are last * In ascending order """ return ( F.when(l['order'].isNull() & r['order'].isNull(), 0) .when(l['order'].isNull(), 1) .when(r['order'].isNull(), -1) .when(l['order'] < r['order'], -1) .when(l['order'] == r['order'], 0) .otherwise(1) ) df.select(F.array_sort(df['value'], comparator)).collect(){code} This works as intended. h1. Ask The documentation for {{array_sort()}} should include information on the behavior of the default comparator for various datatypes. For the array-of-unorderable-structs example, it would be helpful to know that the default comparator for structs compares all fields in schema order (i.e., {{{}ORDER BY field1, field2, ..., fieldN{}}}). Additionally, when users passes an unorderable type to array_sort() and uses the default comparator, the returned error should recommend the user use a custom comparator instead. was: h1. tl;dr It would be helpful for the documentation for array_sort() to include the default comparator behavior for different array element types, especially structs. It would also be helpful for the \{{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE }}error to recommend using a custom comparator instead of the default comparator, especially when sorting on a complex type (e.g., a struct containing an unorderable field, like a map). h1. Background The default comparator for {{array_sort()}} for struct elements is to sort by every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., fieldN). This requires every field to be orderable: if they aren't an error occurs. Here's a small example: {code:java} import pyspark.sql.functions as F import pyspark.sql.types as T schema = T.StructType([ T.StructField( 'value', T.ArrayType( T.StructType([ T.StructField('orderable', T.IntegerType(), True), T.StructField('unorderable', T.MapType(T.StringType(), T.StringType(), True), True), # remove this field and both commands below succeed ]), False ), False ) ]) df = spark.createDataFrame([], schema=schema) df.select(F.array_sort(df['value'])).collect(){code} Output: {code:java} [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code} If the default comparator doesn't work for a user (e.g., they have an unorderable field like a map in their struct), array_sort() accepts a custom comparator, where users can order array elements however they like. Building on the previous example: {code:java} import pyspark.sql as psql d
[jira] [Updated] (SPARK-48386) Replace JVM assert with JUnit Assert in tests
[ https://issues.apache.org/jira/browse/SPARK-48386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48386: --- Labels: pull-request-available (was: ) > Replace JVM assert with JUnit Assert in tests > - > > Key: SPARK-48386 > URL: https://issues.apache.org/jira/browse/SPARK-48386 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored
[ https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Chester Jenks updated SPARK-48361: -- Description: Using corrupt record in CSV parsing for some data cleaning logic, I came across a correctness bug. The following repro can be ran with spark-shell 3.5.1. *Create test.csv with the following content:* {code:java} test,1,2,three four,5,6,seven 8,9 ten,11,12,thirteen {code} *In spark-shell:* {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ # define a STRING, DOUBLE, DOUBLE, STRING schema for the data val schema = StructType(List(StructField("column1", StringType, true), StructField("column2", DoubleType, true), StructField("column3", DoubleType, true), StructField("column4", StringType, true))) # add a column for corrupt records to the schema val schemaWithCorrupt = StructType(schema.fields :+ StructField("_corrupt_record", StringType, true)) # read the CSV with the schema, headers, permissive parsing, and the corrupt record column val df = spark.read.option("header", "true").option("mode", "PERMISSIVE").option("columnNameOfCorruptRecord", "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") # define a UDF to count the commas in the corrupt record column val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) # add a true/false column for whether the number of commas is 3 val dfWithJagged = df.withColumn("__is_jagged", when(col("_corrupt_record").isNull, false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) dfWithJagged.show(){code} *Returns:* {code:java} +---+---+---++---+---+ |column1|column2|column3| column4|_corrupt_record|__is_jagged| +---+---+---++---+---+ | four| 5.0| 6.0| seven| NULL| false| | 8| 9.0| NULL| NULL| 8,9| true| | ten| 11.0| 12.0|thirteen| NULL| false| +---+---+---++---+---+ {code} So far so good... *BUT* *If we add an aggregate before we show:* {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ val schema = StructType(List(StructField("column1", StringType, true), StructField("column2", DoubleType, true), StructField("column3", DoubleType, true), StructField("column4", StringType, true))) val schemaWithCorrupt = StructType(schema.fields :+ StructField("_corrupt_record", StringType, true)) val df = spark.read.option("header", "true").option("mode", "PERMISSIVE").option("columnNameOfCorruptRecord", "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) val dfWithJagged = df.withColumn("__is_jagged", when(col("_corrupt_record").isNull, false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) val dfDropped = dfWithJagged.filter(col("__is_jagged") === true) val groupedSum = dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2")) groupedSum.show(){code} *We get:* {code:java} +---+---+ |column1|sum_column2| +---+---+ | 8| 9.0| | four| 5.0| | ten| 11.0| +---+---+ {code} *Which is not correct* With the addition of the aggregate, the filter down to rows with 3 commas in the corrupt record column is ignored. This does not happed with any other operators I have tried - just aggregates so far. was: Using corrupt record in CSV parsing for some data cleaning logic, I came across a correctness bug. The following repro can be ran with spark-shell 3.5.1. *Create test.csv with the following content:* {code:java} test,1,2,three four,5,6,seven 8,9 ten,11,12,thirteen {code} *In spark-shell:* {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ # define a STRING, DOUBLE, DOUBLE, STRING schema for the data val schema = StructType(List(StructField("column1", StringType, true), StructField("column2", DoubleType, true), StructField("column3", DoubleType, true), StructField("column4", StringType, true))) # add a column for corrupt records to the schema val schemaWithCorrupt = StructType(schema.fields :+ StructField("_corrupt_record", StringType, true)) # read the CSV with the schema, headers, permissive parsing, and the corrupt record column val df = spark.read.option("header", "true").option("mode", "PERMISSIVE").option("columnNameOfCorruptRecord", "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") # define a UDF to count the commas in the corrupt record column val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) # add a true/false column for whether the number of commas is 3 val dfWithJagged = df.withColumn("__is_jagged", when(col("_corrupt_reco
[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored
[ https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848696#comment-17848696 ] Bruce Robbins commented on SPARK-48361: --- `8,9` is still present before the aggregate: {noformat} scala> dfWithJagged.show(false) 24/05/22 10:33:24 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: test, 1, 2, three Schema: column1, column2, column3, column4 Expected: column1 but found: test CSV file: file:///Users/bruce/github/spark_up3.5.1/test.csv +---+---+---++---+---+ |column1|column2|column3|column4 |_corrupt_record|__is_jagged| +---+---+---++---+---+ |four |5.0|6.0|seven |NULL |false | |8 |9.0|NULL |NULL|8,9|true | |ten|11.0 |12.0 |thirteen|NULL |false | +---+---+---++---+---+ scala> sql("select version()").collect res6: Array[org.apache.spark.sql.Row] = Array([3.5.1 fd86f85e181fc2dc0f50a096855acf83a6cc5d9c]) scala> {noformat} Which piece of code filters out `8,9`? I could't find the filter in your example, but again I may be missing something. > Correctness: CSV corrupt record filter with aggregate ignored > - > > Key: SPARK-48361 > URL: https://issues.apache.org/jira/browse/SPARK-48361 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Using spark shell 3.5.1 on M1 Mac >Reporter: Ted Chester Jenks >Priority: Major > > Using corrupt record in CSV parsing for some data cleaning logic, I came > across a correctness bug. > > The following repro can be ran with spark-shell 3.5.1. > *Create test.csv with the following content:* > {code:java} > test,1,2,three > four,5,6,seven > 8,9 > ten,11,12,thirteen {code} > > > *In spark-shell:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > dfWithJagged.show(){code} > *Returns:* > {code:java} > +---+---+---++---+---+ > |column1|column2|column3| column4|_corrupt_record|__is_jagged| > +---+---+---++---+---+ > | four| 5.0| 6.0| seven| NULL| false| > | 8| 9.0| NULL| NULL| 8,9| true| > | ten| 11.0| 12.0|thirteen| NULL| false| > +---+---+---++---+---+ {code} > So far so good... > > *BUT* > > *If we add an aggregate before we show:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged",
[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored
[ https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848694#comment-17848694 ] Ted Chester Jenks commented on SPARK-48361: --- {code:java} +---+---+ |column1|sum_column2| +---+---+ | four| 5.0| | ten| 11.0| +---+---+ {code} The row with `8,9` should be filtered out as it was before adding the aggregate. > Correctness: CSV corrupt record filter with aggregate ignored > - > > Key: SPARK-48361 > URL: https://issues.apache.org/jira/browse/SPARK-48361 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Using spark shell 3.5.1 on M1 Mac >Reporter: Ted Chester Jenks >Priority: Major > > Using corrupt record in CSV parsing for some data cleaning logic, I came > across a correctness bug. > > The following repro can be ran with spark-shell 3.5.1. > *Create test.csv with the following content:* > {code:java} > test,1,2,three > four,5,6,seven > 8,9 > ten,11,12,thirteen {code} > > > *In spark-shell:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > dfWithJagged.show(){code} > *Returns:* > {code:java} > +---+---+---++---+---+ > |column1|column2|column3| column4|_corrupt_record|__is_jagged| > +---+---+---++---+---+ > | four| 5.0| 6.0| seven| NULL| false| > | 8| 9.0| NULL| NULL| 8,9| true| > | ten| 11.0| 12.0|thirteen| NULL| false| > +---+---+---++---+---+ {code} > So far so good... > > *BUT* > > *If we add an aggregate before we show:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > > # sum up column1 > val groupedSum = > dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2")) > groupedSum.show(){code} > *We get:* > {code:java} > +---+---+ > |column1|sum_column2| > +---+---+ > | 8| 9.0| > | four| 5.0| > | ten| 11.0| > +---+---+ {code} > > *Which is not correct* > > With the addition of the aggregate, the filter down to rows with 3 commas in > the corrupt record column is ignored. This does not happed with any other > operators I have tried - just aggregates so far. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --
[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored
[ https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848692#comment-17848692 ] Bruce Robbins commented on SPARK-48361: --- Sorry for being dense. What would the correct answer be? > Correctness: CSV corrupt record filter with aggregate ignored > - > > Key: SPARK-48361 > URL: https://issues.apache.org/jira/browse/SPARK-48361 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Using spark shell 3.5.1 on M1 Mac >Reporter: Ted Chester Jenks >Priority: Major > > Using corrupt record in CSV parsing for some data cleaning logic, I came > across a correctness bug. > > The following repro can be ran with spark-shell 3.5.1. > *Create test.csv with the following content:* > {code:java} > test,1,2,three > four,5,6,seven > 8,9 > ten,11,12,thirteen {code} > > > *In spark-shell:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > dfWithJagged.show(){code} > *Returns:* > {code:java} > +---+---+---++---+---+ > |column1|column2|column3| column4|_corrupt_record|__is_jagged| > +---+---+---++---+---+ > | four| 5.0| 6.0| seven| NULL| false| > | 8| 9.0| NULL| NULL| 8,9| true| > | ten| 11.0| 12.0|thirteen| NULL| false| > +---+---+---++---+---+ {code} > So far so good... > > *BUT* > > *If we add an aggregate before we show:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > > # sum up column1 > val groupedSum = > dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2")) > groupedSum.show(){code} > *We get:* > {code:java} > +---+---+ > |column1|sum_column2| > +---+---+ > | 8| 9.0| > | four| 5.0| > | ten| 11.0| > +---+---+ {code} > > *Which is not correct* > > With the addition of the aggregate, the filter down to rows with 3 commas in > the corrupt record column is ignored. This does not happed with any other > operators I have tried - just aggregates so far. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43043: --- Labels: pull-request-available (was: ) > Improve the performance of MapOutputTracker.updateMapOutput > --- > > Key: SPARK-43043 > URL: https://issues.apache.org/jira/browse/SPARK-43043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Inside of MapOutputTracker, there is a line of code which does a linear find > through a mapStatuses collection: > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 > (plus a similar search a few lines down at > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) > This scan is necessary because we only know the mapId of the updated status > and not its mapPartitionId. > We perform this scan once per migrated block, so if a large proportion of all > blocks in the map are migrated then we get O(n^2) total runtime across all of > the calls. > I think we might be able to fix this by extending ShuffleStatus to have an > OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate
[ https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-48391: --- Summary: use addAll instead of add function in TaskMetrics to accelerate (was: use addAll instead of add function in TaskMetrics ) > use addAll instead of add function in TaskMetrics to accelerate > - > > Key: SPARK-48391 > URL: https://issues.apache.org/jira/browse/SPARK-48391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: jiahong.li >Priority: Major > > In the fromAccumulators method of TaskMetrics,we should use ` > tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as > _externalAccums is a instance of CopyOnWriteArrayList -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48391) use addAll instead of add function in TaskMetrics
jiahong.li created SPARK-48391: -- Summary: use addAll instead of add function in TaskMetrics Key: SPARK-48391 URL: https://issues.apache.org/jira/browse/SPARK-48391 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.1, 3.5.0 Reporter: jiahong.li In the fromAccumulators method of TaskMetrics,we should use ` tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as _externalAccums is a instance of CopyOnWriteArrayList -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48390) SparkListenerBus not sending tableName details in logical plan for spark versions 3.4.2 and above
Mayur Madnani created SPARK-48390: - Summary: SparkListenerBus not sending tableName details in logical plan for spark versions 3.4.2 and above Key: SPARK-48390 URL: https://issues.apache.org/jira/browse/SPARK-48390 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.4.3, 3.5.1, 3.5.0, 3.4.2, 3.5.2 Reporter: Mayur Madnani In OpenLineage, via SparkEventListener a logical plan event is received and by parsing it the frameworks deduces Input/Output table names to create a lineage. The issue is that in spark versions 3.4.2 and above (tested and reproducible in 3.4.2 & 3.5.0) the logical plan event sent by spark core is partial and is missing the tableName property which was been sent in earlier versions (working in spark 3.3.4). +_Note: This issue is only encountered in drop table events._+ For a drop table event, see below the logical plan in different spark versions *Spark 3.3.4* {code:java} [ { "class": "org.apache.spark.sql.execution.command.DropTableCommand", "num-children": 0, "tableName": { "product-class": "org.apache.spark.sql.catalyst.TableIdentifier", "table": "drop_table_test", "database": "default" } , "ifExists": false, "isView": false, "purge": false } ] {code} *Spark 3.4.2* {code:java} [ { "class": "org.apache.spark.sql.catalyst.plans.logical.DropTable", "num-children": 1, "child": 0, "ifExists": false, "purge": false } , { "class": "org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier", "num-children": 0, "catalog": null, "identifier": null } ] {code} More details in referenced issue here: [https://github.com/OpenLineage/OpenLineage/issues/2716] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48364) Type casting for AbstractMapType
[ https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48364: --- Assignee: Uroš Bojanić > Type casting for AbstractMapType > > > Key: SPARK-48364 > URL: https://issues.apache.org/jira/browse/SPARK-48364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48364) Type casting for AbstractMapType
[ https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48364. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46661 [https://github.com/apache/spark/pull/46661] > Type casting for AbstractMapType > > > Key: SPARK-48364 > URL: https://issues.apache.org/jira/browse/SPARK-48364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48364) Type casting for AbstractMapType
[ https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48364: --- Labels: pull-request-available (was: ) > Type casting for AbstractMapType > > > Key: SPARK-48364 > URL: https://issues.apache.org/jira/browse/SPARK-48364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48215) DateFormatClass (all collations)
[ https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48215. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46561 [https://github.com/apache/spark/pull/46561] > DateFormatClass (all collations) > > > Key: SPARK-48215 > URL: https://issues.apache.org/jira/browse/SPARK-48215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Nebojsa Savic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Enable collation support for the *DateFormatClass* built-in function in > Spark. First confirm what is the expected behaviour for this expression when > given collated strings, and then move on to implementation and testing. You > will find this expression in the *datetimeExpressions.scala* file, and it > should be considered a pass-through function with respect to collation > awareness. Implement the corresponding E2E SQL tests > (CollationSQLExpressionsSuite) to reflect how this function should be used > with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor > to experiment with the existing functions to learn more about how they work. > In addition, look into the possible use-cases and implementation of similar > functions within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *DateFormatClass* > expression so that it supports all collation types currently supported in > Spark. To understand what changes were introduced in order to enable full > collation support for other existing functions in Spark, take a look at the > Spark PRs and Jira tickets for completed tasks in this parent (for example: > Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, > FormatNumber, Sentences). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48215) DateFormatClass (all collations)
[ https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48215: --- Assignee: Nebojsa Savic > DateFormatClass (all collations) > > > Key: SPARK-48215 > URL: https://issues.apache.org/jira/browse/SPARK-48215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Nebojsa Savic >Priority: Major > Labels: pull-request-available > > Enable collation support for the *DateFormatClass* built-in function in > Spark. First confirm what is the expected behaviour for this expression when > given collated strings, and then move on to implementation and testing. You > will find this expression in the *datetimeExpressions.scala* file, and it > should be considered a pass-through function with respect to collation > awareness. Implement the corresponding E2E SQL tests > (CollationSQLExpressionsSuite) to reflect how this function should be used > with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor > to experiment with the existing functions to learn more about how they work. > In addition, look into the possible use-cases and implementation of similar > functions within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *DateFormatClass* > expression so that it supports all collation types currently supported in > Spark. To understand what changes were introduced in order to enable full > collation support for other existing functions in Spark, take a look at the > Spark PRs and Jira tickets for completed tasks in this parent (for example: > Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, > FormatNumber, Sentences). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-48379) Cancel build during a PR when a new commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-48379: -- Assignee: (was: Stefan Kandic) Reverted in https://github.com/apache/spark/commit/9fd85d9acc5acf455d0ad910ef2848695576242b > Cancel build during a PR when a new commit is pushed > > > Key: SPARK-48379 > URL: https://issues.apache.org/jira/browse/SPARK-48379 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Creating a new commit on a branch should cancel the build of previous commits > for the same branch. > Exceptions are master and branch-* branches where we still want to have > concurrent builds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48379) Cancel build during a PR when a new commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-48379: - Fix Version/s: (was: 4.0.0) > Cancel build during a PR when a new commit is pushed > > > Key: SPARK-48379 > URL: https://issues.apache.org/jira/browse/SPARK-48379 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > > Creating a new commit on a branch should cancel the build of previous commits > for the same branch. > Exceptions are master and branch-* branches where we still want to have > concurrent builds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs
[ https://issues.apache.org/jira/browse/SPARK-48389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48389. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46703 [https://github.com/apache/spark/pull/46703] > Remove obsolete workflow cancel_duplicate_workflow_runs > --- > > Key: SPARK-48389 > URL: https://issues.apache.org/jira/browse/SPARK-48389 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > After https://github.com/apache/spark/pull/46689, we don't need this anymore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48373) Allow schema parameter of createDataFrame() to be length-1 list or tuple of StructType
[ https://issues.apache.org/jira/browse/SPARK-48373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook resolved SPARK-48373. -- Resolution: Won't Fix > Allow schema parameter of createDataFrame() to be length-1 list or tuple of > StructType > -- > > Key: SPARK-48373 > URL: https://issues.apache.org/jira/browse/SPARK-48373 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Ian Cook >Priority: Major > Labels: pull-request-available > > Currently in PySpark (both Classic and Connect), if a user passes a length-1 > list or tuple of {{StructType}} as the {{schema}} argument to > {{{}createDataFrame{}}}, PySpark raises an unhelpful error message. > Unfortunately it is easy for this to happen. For example if a user leaves a > trailing comma at the end of the line that defines the {{{}StructType{}}}. > Add some simple code to {{createDataFrame}} to handle this case more > gracefully -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-48373) Allow schema parameter of createDataFrame() to be length-1 list or tuple of StructType
[ https://issues.apache.org/jira/browse/SPARK-48373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook closed SPARK-48373. > Allow schema parameter of createDataFrame() to be length-1 list or tuple of > StructType > -- > > Key: SPARK-48373 > URL: https://issues.apache.org/jira/browse/SPARK-48373 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Ian Cook >Priority: Major > Labels: pull-request-available > > Currently in PySpark (both Classic and Connect), if a user passes a length-1 > list or tuple of {{StructType}} as the {{schema}} argument to > {{{}createDataFrame{}}}, PySpark raises an unhelpful error message. > Unfortunately it is easy for this to happen. For example if a user leaves a > trailing comma at the end of the line that defines the {{{}StructType{}}}. > Add some simple code to {{createDataFrame}} to handle this case more > gracefully -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs
[ https://issues.apache.org/jira/browse/SPARK-48389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48389: Assignee: Hyukjin Kwon > Remove obsolete workflow cancel_duplicate_workflow_runs > --- > > Key: SPARK-48389 > URL: https://issues.apache.org/jira/browse/SPARK-48389 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > After https://github.com/apache/spark/pull/46689, we don't need this anymore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs
[ https://issues.apache.org/jira/browse/SPARK-48389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48389: --- Labels: pull-request-available (was: ) > Remove obsolete workflow cancel_duplicate_workflow_runs > --- > > Key: SPARK-48389 > URL: https://issues.apache.org/jira/browse/SPARK-48389 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > After https://github.com/apache/spark/pull/46689, we don't need this anymore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs
Hyukjin Kwon created SPARK-48389: Summary: Remove obsolete workflow cancel_duplicate_workflow_runs Key: SPARK-48389 URL: https://issues.apache.org/jira/browse/SPARK-48389 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Hyukjin Kwon After https://github.com/apache/spark/pull/46689, we don't need this anymore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48388) Fix SET behavior for scripts
David Milicevic created SPARK-48388: --- Summary: Fix SET behavior for scripts Key: SPARK-48388 URL: https://issues.apache.org/jira/browse/SPARK-48388 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: David Milicevic By standard, SET is used to set variable value in SQL scripts. On our end, SET is configured to work with some Hive configs, so the grammar is a bit messed up and for that reason it was decided to use SET VAR instead of SET to work with SQL variables. This is not by standard and we should figure out the way to be able to use SET for SQL variables and forbid setting of Hive configs from SQL scripts. For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48379) Cancel build during a PR when a new commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48379: Assignee: Stefan Kandic > Cancel build during a PR when a new commit is pushed > > > Key: SPARK-48379 > URL: https://issues.apache.org/jira/browse/SPARK-48379 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Creating a new commit on a branch should cancel the build of previous commits > for the same branch. > Exceptions are master and branch-* branches where we still want to have > concurrent builds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48379) Cancel build during a PR when a new commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48379. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46689 [https://github.com/apache/spark/pull/46689] > Cancel build during a PR when a new commit is pushed > > > Key: SPARK-48379 > URL: https://issues.apache.org/jira/browse/SPARK-48379 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Creating a new commit on a branch should cancel the build of previous commits > for the same branch. > Exceptions are master and branch-* branches where we still want to have > concurrent builds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48379) Cancel build during a PR when a new commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48379: -- Assignee: (was: Apache Spark) > Cancel build during a PR when a new commit is pushed > > > Key: SPARK-48379 > URL: https://issues.apache.org/jira/browse/SPARK-48379 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > > Creating a new commit on a branch should cancel the build of previous commits > for the same branch. > Exceptions are master and branch-* branches where we still want to have > concurrent builds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48379) Cancel build during a PR when a new commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48379: -- Assignee: Apache Spark > Cancel build during a PR when a new commit is pushed > > > Key: SPARK-48379 > URL: https://issues.apache.org/jira/browse/SPARK-48379 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Creating a new commit on a branch should cancel the build of previous commits > for the same branch. > Exceptions are master and branch-* branches where we still want to have > concurrent builds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48370: -- Assignee: Apache Spark > Checkpoint and localCheckpoint in Scala Spark Connect client > > > Key: SPARK-48370 > URL: https://issues.apache.org/jira/browse/SPARK-48370 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark > Connect client. We should do it in Scala too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48370: -- Assignee: (was: Apache Spark) > Checkpoint and localCheckpoint in Scala Spark Connect client > > > Key: SPARK-48370 > URL: https://issues.apache.org/jira/browse/SPARK-48370 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark > Connect client. We should do it in Scala too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception
[ https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848516#comment-17848516 ] Sumit Singh commented on SPARK-48311: - Looks like issue is due to *expr.transformUp* code in ExtractGroupingPythonUDFFromAggregate {code:java} val aggExpr = agg.aggregateExpressions.map { expr => expr.transformUp { // PythonUDF over aggregate was pull out by ExtractPythonUDFFromAggregate. // PythonUDF here should be either // 1. Argument of an aggregate function. //CheckAnalysis guarantees the arguments are deterministic. // 2. PythonUDF in grouping key. Grouping key must be deterministic. // 3. PythonUDF not in grouping key. It is either no arguments or with grouping key // in its arguments. Such PythonUDF was pull out by ExtractPythonUDFFromAggregate, too. case p: PythonUDF if p.udfDeterministic => val canonicalized = p.canonicalized.asInstanceOf[PythonUDF] attributeMap.getOrElse(canonicalized, p) {code} If we have udf1("a"), udf2(udf1("a") in grouping and udf2(udf1("a") in aggregate. Then because of *expr.transformUp* for expr udf2(udf1("a")) 1. udf1("a") will be picked and it will be replaced by grouping by canonicalized value some groupingUDF# 2. then it will become udf2(groupingUDF#) 3. now this will not be found in cache and will be add as it is. I think this should be change to expr.transformDown as it is in grouping section in same class. Details: [https://docs.google.com/document/d/1RXOOCsaFU-E1ZmXJnSrJ2jdRgYQgGAVxDPeFNNaoB1M/edit?usp=sharing] > Nested pythonUDF in groupBy and aggregate result in Binding Exception > -- > > Key: SPARK-48311 > URL: https://issues.apache.org/jira/browse/SPARK-48311 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 >Reporter: Sumit Singh >Priority: Major > > Steps to Reproduce > 1. Data creation > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, LongType, > TimestampType, StringType > from datetime import datetime > # Define the schema > schema = StructType([ > StructField("col1", LongType(), nullable=True), > StructField("col2", TimestampType(), nullable=True), > StructField("col3", StringType(), nullable=True) > ]) > # Define the data > data = [ > (1, datetime(2023, 5, 15, 12, 30), "Discount"), > (2, datetime(2023, 5, 16, 16, 45), "Promotion"), > (3, datetime(2023, 5, 17, 9, 15), "Coupon") > ] > # Create the DataFrame > df = spark.createDataFrame(data, schema) > df.createOrReplaceTempView("temp_offers") > # Query the temporary table using SQL > # DISTINCT required to reproduce the issue. > testDf = spark.sql(""" > SELECT DISTINCT > col1, > col2, > col3 FROM temp_offers > """) {code} > 2. UDF registration > {code:java} > import pyspark.sql.functions as F > import pyspark.sql.types as T > #Creating udf functions > def udf1(d): > return d > def udf2(d): > if d.isoweekday() in (1, 2, 3, 4): > return 'WEEKDAY' > else: > return 'WEEKEND' > udf1_name = F.udf(udf1, T.TimestampType()) > udf2_name = F.udf(udf2, T.StringType()) {code} > 3. Adding UDF in grouping and agg > {code:java} > groupBy_cols = ['col1', 'col4', 'col5', 'col3'] > temp = testDf \ > .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', > udf2_name('col4').alias('col5')) > result = > (temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code} > 4. Result > {code:java} > result.show(5, False) {code} > *We get below error* > {code:java} > An error was encountered: > An error occurred while calling o1079.showString. > : java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in > [col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48385) Migrate the jdbc driver of mariadb from 2.x to 3.x
[ https://issues.apache.org/jira/browse/SPARK-48385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48385: --- Labels: pull-request-available (was: ) > Migrate the jdbc driver of mariadb from 2.x to 3.x > -- > > Key: SPARK-48385 > URL: https://issues.apache.org/jira/browse/SPARK-48385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
[ https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48387: --- Labels: pull-request-available (was: ) > Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE > --- > > Key: SPARK-48387 > URL: https://issues.apache.org/jira/browse/SPARK-48387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception
[ https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848511#comment-17848511 ] kalyan s commented on SPARK-48311: -- Seems something changed in ExtractGroupingPythonUDFFromAggregate by this commit: fdccd88c2a6dd18c9d446b63fccd5c6188ea125c [~cloud_fan] Can you help this change? > Nested pythonUDF in groupBy and aggregate result in Binding Exception > -- > > Key: SPARK-48311 > URL: https://issues.apache.org/jira/browse/SPARK-48311 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 >Reporter: Sumit Singh >Priority: Major > > Steps to Reproduce > 1. Data creation > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, LongType, > TimestampType, StringType > from datetime import datetime > # Define the schema > schema = StructType([ > StructField("col1", LongType(), nullable=True), > StructField("col2", TimestampType(), nullable=True), > StructField("col3", StringType(), nullable=True) > ]) > # Define the data > data = [ > (1, datetime(2023, 5, 15, 12, 30), "Discount"), > (2, datetime(2023, 5, 16, 16, 45), "Promotion"), > (3, datetime(2023, 5, 17, 9, 15), "Coupon") > ] > # Create the DataFrame > df = spark.createDataFrame(data, schema) > df.createOrReplaceTempView("temp_offers") > # Query the temporary table using SQL > # DISTINCT required to reproduce the issue. > testDf = spark.sql(""" > SELECT DISTINCT > col1, > col2, > col3 FROM temp_offers > """) {code} > 2. UDF registration > {code:java} > import pyspark.sql.functions as F > import pyspark.sql.types as T > #Creating udf functions > def udf1(d): > return d > def udf2(d): > if d.isoweekday() in (1, 2, 3, 4): > return 'WEEKDAY' > else: > return 'WEEKEND' > udf1_name = F.udf(udf1, T.TimestampType()) > udf2_name = F.udf(udf2, T.StringType()) {code} > 3. Adding UDF in grouping and agg > {code:java} > groupBy_cols = ['col1', 'col4', 'col5', 'col3'] > temp = testDf \ > .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', > udf2_name('col4').alias('col5')) > result = > (temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code} > 4. Result > {code:java} > result.show(5, False) {code} > *We get below error* > {code:java} > An error was encountered: > An error occurred while calling o1079.showString. > : java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in > [col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
Kent Yao created SPARK-48387: Summary: Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE Key: SPARK-48387 URL: https://issues.apache.org/jira/browse/SPARK-48387 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48386) Replace JVM assert with JUnit Assert in tests
BingKun Pan created SPARK-48386: --- Summary: Replace JVM assert with JUnit Assert in tests Key: SPARK-48386 URL: https://issues.apache.org/jira/browse/SPARK-48386 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47920) Add documentation for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-47920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47920. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46139 [https://github.com/apache/spark/pull/46139] > Add documentation for python streaming data source > -- > > Key: SPARK-47920 > URL: https://issues.apache.org/jira/browse/SPARK-47920 > Project: Spark > Issue Type: New Feature > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add documentation (user guide) for Python data source API. > The DOC should explain how to develop and use DataSourceStreamReader and > DataSourceStreamWriter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45716) Python parity method StructType.treeString
[ https://issues.apache.org/jira/browse/SPARK-45716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45716. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46685 [https://github.com/apache/spark/pull/46685] > Python parity method StructType.treeString > -- > > Key: SPARK-45716 > URL: https://issues.apache.org/jira/browse/SPARK-45716 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Khalid Mammadov >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Add missing parity megthod from Scala to Python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48372) Implement `StructType.treeString`
[ https://issues.apache.org/jira/browse/SPARK-48372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-48372. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46685 [https://github.com/apache/spark/pull/46685] > Implement `StructType.treeString` > - > > Key: SPARK-48372 > URL: https://issues.apache.org/jira/browse/SPARK-48372 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48385) Migrate the jdbc driver of mariadb from 2.x to 3.x
[ https://issues.apache.org/jira/browse/SPARK-48385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-48385: Summary: Migrate the jdbc driver of mariadb from 2.x to 3.x (was: Migrate the driver of mariadb from 2.x to 3.x) > Migrate the jdbc driver of mariadb from 2.x to 3.x > -- > > Key: SPARK-48385 > URL: https://issues.apache.org/jira/browse/SPARK-48385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org