[jira] [Commented] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696410#comment-17696410 ] Apache Spark commented on SPARK-42555: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40277 > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42638) current_user() is blocked from VALUES, but current_timestamp() is not
[ https://issues.apache.org/jira/browse/SPARK-42638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696409#comment-17696409 ] ming95 commented on SPARK-42638: Maybe we can use `insert as select` to achieve the same effect? > current_user() is blocked from VALUES, but current_timestamp() is not > - > > Key: SPARK-42638 > URL: https://issues.apache.org/jira/browse/SPARK-42638 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Serge Rielau >Priority: Major > > VALUES(current_user()); > returns: > cannot evaluate expression current_user() in inline table definition.; line 1 > pos 8 > > The same with current_timestamp() works. > It appears current_user() is recognized as non-deterministic. But it is > constant within the statement, just like current_timestanmp(). > PS: It's not clear why we block non-deterministic functions to begin with -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time
[ https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ming95 resolved SPARK-40885. Resolution: Fixed > Spark will filter out data field sorting when dynamic partitions and data > fields are sorted at the same time > > > Key: SPARK-40885 > URL: https://issues.apache.org/jira/browse/SPARK-40885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.3.0, 3.2.2, 3.4.0 >Reporter: ming95 >Priority: Major > Attachments: 1666494504884.jpg > > > When using dynamic partitions to write data and sort partitions and data > fields, Spark will filter the sorting of data fields. > > reproduce sql: > {code:java} > CREATE TABLE `sort_table`( > `id` int, > `name` string > ) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION 'sort_table';CREATE TABLE `test_table`( > `id` int, > `name` string) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION > 'test_table';//gen test data > insert into test_table partition(dt=20221011) select 10,"15" union all select > 1,"10" union all select 5,"50" union all select 20,"2" union all select > 30,"14" ; > set spark.hadoop.hive.exec.dynamici.partition=true; > set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict; > // this sql sort with partition filed (`dt`) and data filed (`name`), but > sort with `name` can not work > insert overwrite table sort_table partition(dt) select id,name,dt from > test_table order by name,dt; > {code} > > The Sort operator of DAG has only one sort field, but there are actually two > in SQL.(See the attached drawing) > > It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.
[ https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42556. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40265 [https://github.com/apache/spark/pull/40265] > Dataset.colregex should link a plan_id when it only matches a single column. > > > Key: SPARK-42556 > URL: https://issues.apache.org/jira/browse/SPARK-42556 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > When colregex returns a single column it should link the plans plan_id. For > reference here is the non-connect Dataset code that does this: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512] > This also needs to be fixed for the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42562) UnresolvedLambdaVariables in python do not need unique names
[ https://issues.apache.org/jira/browse/SPARK-42562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696397#comment-17696397 ] Herman van Hövell commented on SPARK-42562: --- It currently generates unique names, but it doesn't need to. I think we should remove that. > UnresolvedLambdaVariables in python do not need unique names > > > Key: SPARK-42562 > URL: https://issues.apache.org/jira/browse/SPARK-42562 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > UnresolvedLambdaVariables do not need unique names in python. We already did > this for the scala client, and it is good to have parity between the two > implementations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42669) Short circuit local relation rpcs
Herman van Hövell created SPARK-42669: - Summary: Short circuit local relation rpcs Key: SPARK-42669 URL: https://issues.apache.org/jira/browse/SPARK-42669 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Operations on LocalRelation can mostly be done locally (without sending rpcs). We should leverage this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42552) Get ParseException when run sql: "SELECT 1 UNION SELECT 1;"
[ https://issues.apache.org/jira/browse/SPARK-42552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696395#comment-17696395 ] jiang13021 commented on SPARK-42552: The problem may be in this location: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L126] When the `PredictionMode` is `SLL`, `AstBuilder` will throw `ParseException` instead of `ParseCancellationException`,so the parser doesn't try `LL` mode. In fact, if we use `LL` mode, we can parse the sql correctly. > Get ParseException when run sql: "SELECT 1 UNION SELECT 1;" > --- > > Key: SPARK-42552 > URL: https://issues.apache.org/jira/browse/SPARK-42552 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Priority: Major > Fix For: 3.2.3 > > > When I run sql > {code:java} > scala> spark.sql("SELECT 1 UNION SELECT 1;") {code} > I get ParseException: > {code:java} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)== SQL == > SELECT 1 UNION SELECT 1; > ---^^^ at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:127) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:77) > at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) > ... 47 elided > {code} > If I run with parentheses , it works well > {code:java} > scala> spark.sql("(SELECT 1) UNION (SELECT 1);") > res4: org.apache.spark.sql.DataFrame = [1: int]{code} > This should be a bug > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42630) Make `parse_data_type` use new proto message `DDLParse`
[ https://issues.apache.org/jira/browse/SPARK-42630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696392#comment-17696392 ] Apache Spark commented on SPARK-42630: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40276 > Make `parse_data_type` use new proto message `DDLParse` > --- > > Key: SPARK-42630 > URL: https://issues.apache.org/jira/browse/SPARK-42630 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42630) Make `parse_data_type` use new proto message `DDLParse`
[ https://issues.apache.org/jira/browse/SPARK-42630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696391#comment-17696391 ] Apache Spark commented on SPARK-42630: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40276 > Make `parse_data_type` use new proto message `DDLParse` > --- > > Key: SPARK-42630 > URL: https://issues.apache.org/jira/browse/SPARK-42630 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42555. --- Fix Version/s: 3.4.1 Assignee: jiaan.geng Resolution: Fixed > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42562) UnresolvedLambdaVariables in python do not need unique names
[ https://issues.apache.org/jira/browse/SPARK-42562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696390#comment-17696390 ] jiaan.geng commented on SPARK-42562: [~hvanhovell]I don't understand this issue. Could you tell me where UnresolvedLambdaVariables need unique names in python ? > UnresolvedLambdaVariables in python do not need unique names > > > Key: SPARK-42562 > URL: https://issues.apache.org/jira/browse/SPARK-42562 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > UnresolvedLambdaVariables do not need unique names in python. We already did > this for the scala client, and it is good to have parity between the two > implementations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42562) UnresolvedLambdaVariables in python do not need unique names
[ https://issues.apache.org/jira/browse/SPARK-42562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696390#comment-17696390 ] jiaan.geng edited comment on SPARK-42562 at 3/4/23 2:59 AM: [~hvanhovell] I don't understand this issue. Could you tell me where UnresolvedLambdaVariables need unique names in python ? was (Author: beliefer): [~hvanhovell]I don't understand this issue. Could you tell me where UnresolvedLambdaVariables need unique names in python ? > UnresolvedLambdaVariables in python do not need unique names > > > Key: SPARK-42562 > URL: https://issues.apache.org/jira/browse/SPARK-42562 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > UnresolvedLambdaVariables do not need unique names in python. We already did > this for the scala client, and it is good to have parity between the two > implementations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42552) Get ParseException when run sql: "SELECT 1 UNION SELECT 1;"
[ https://issues.apache.org/jira/browse/SPARK-42552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiang13021 updated SPARK-42552: --- Priority: Major (was: Minor) > Get ParseException when run sql: "SELECT 1 UNION SELECT 1;" > --- > > Key: SPARK-42552 > URL: https://issues.apache.org/jira/browse/SPARK-42552 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Priority: Major > Fix For: 3.2.3 > > > When I run sql > {code:java} > scala> spark.sql("SELECT 1 UNION SELECT 1;") {code} > I get ParseException: > {code:java} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)== SQL == > SELECT 1 UNION SELECT 1; > ---^^^ at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:127) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:77) > at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) > ... 47 elided > {code} > If I run with parentheses , it works well > {code:java} > scala> spark.sql("(SELECT 1) UNION (SELECT 1);") > res4: org.apache.spark.sql.DataFrame = [1: int]{code} > This should be a bug > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557 ] jiaan.geng deleted comment on SPARK-42557: was (Author: beliefer): I will take a look! > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42557: Assignee: (was: Apache Spark) > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42557: Assignee: Apache Spark > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696387#comment-17696387 ] Apache Spark commented on SPARK-42557: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40275 > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42562) UnresolvedLambdaVariables in python do not need unique names
[ https://issues.apache.org/jira/browse/SPARK-42562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696386#comment-17696386 ] jiaan.geng commented on SPARK-42562: I will take a look! > UnresolvedLambdaVariables in python do not need unique names > > > Key: SPARK-42562 > URL: https://issues.apache.org/jira/browse/SPARK-42562 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > UnresolvedLambdaVariables do not need unique names in python. We already did > this for the scala client, and it is good to have parity between the two > implementations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42563) Implement SparkSession.newSession
[ https://issues.apache.org/jira/browse/SPARK-42563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42563. --- Resolution: Duplicate > Implement SparkSession.newSession > - > > Key: SPARK-42563 > URL: https://issues.apache.org/jira/browse/SPARK-42563 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement SparkSession.newSession for Connect. > {code:java} > /** > * Start a new session with isolated SQL configurations, temporary tables, > registered > * functions are isolated, but sharing the underlying `SparkContext` and > cached data. > * > * @note Other than the `SparkContext`, all shared state is initialized > lazily. > * This method will force the initialization of the shared state to ensure > that parent > * and child sessions are set up with the same shared state. If the > underlying catalog > * implementation is Hive, this will initialize the metastore, which may take > some time. > * > * @since 2.0.0 > */ > def newSession(): SparkSession {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42467) Spark Connect Scala Client: GroupBy and Aggregation
[ https://issues.apache.org/jira/browse/SPARK-42467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42467. --- Fix Version/s: 3.4.1 Resolution: Fixed > Spark Connect Scala Client: GroupBy and Aggregation > --- > > Key: SPARK-42467 > URL: https://issues.apache.org/jira/browse/SPARK-42467 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42667) Spark Connect: newSession API
[ https://issues.apache.org/jira/browse/SPARK-42667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-42667: -- Epic Link: SPARK-42554 > Spark Connect: newSession API > - > > Key: SPARK-42667 > URL: https://issues.apache.org/jira/browse/SPARK-42667 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42667) Spark Connect: newSession API
[ https://issues.apache.org/jira/browse/SPARK-42667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42667. --- Fix Version/s: 3.4.1 Resolution: Fixed > Spark Connect: newSession API > - > > Key: SPARK-42667 > URL: https://issues.apache.org/jira/browse/SPARK-42667 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42175) Implement more methods in the Scala Client Dataset API
[ https://issues.apache.org/jira/browse/SPARK-42175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhen Li resolved SPARK-42175. - Resolution: Duplicate > Implement more methods in the Scala Client Dataset API > -- > > Key: SPARK-42175 > URL: https://issues.apache.org/jira/browse/SPARK-42175 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Also fix the TODOs in the MiMa compatibility test. > https://github.com/apache/spark/pull/39712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42215) Better Scala Client Integration test
[ https://issues.apache.org/jira/browse/SPARK-42215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42215: Assignee: (was: Apache Spark) > Better Scala Client Integration test > > > Key: SPARK-42215 > URL: https://issues.apache.org/jira/browse/SPARK-42215 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > The current Scala client has a few integration tests that requires a build > first before running client tests. This is not very nice to maven developers > as they will not be able to do a `mvn clean install` to run all tests. > > Look into marking these test as ITs and other better ways for maven to run > test after packages are built. > > Make sure the test run in SBT as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42215) Better Scala Client Integration test
[ https://issues.apache.org/jira/browse/SPARK-42215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696378#comment-17696378 ] Apache Spark commented on SPARK-42215: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40274 > Better Scala Client Integration test > > > Key: SPARK-42215 > URL: https://issues.apache.org/jira/browse/SPARK-42215 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > The current Scala client has a few integration tests that requires a build > first before running client tests. This is not very nice to maven developers > as they will not be able to do a `mvn clean install` to run all tests. > > Look into marking these test as ITs and other better ways for maven to run > test after packages are built. > > Make sure the test run in SBT as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42215) Better Scala Client Integration test
[ https://issues.apache.org/jira/browse/SPARK-42215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42215: Assignee: Apache Spark > Better Scala Client Integration test > > > Key: SPARK-42215 > URL: https://issues.apache.org/jira/browse/SPARK-42215 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Apache Spark >Priority: Major > > The current Scala client has a few integration tests that requires a build > first before running client tests. This is not very nice to maven developers > as they will not be able to do a `mvn clean install` to run all tests. > > Look into marking these test as ITs and other better ways for maven to run > test after packages are built. > > Make sure the test run in SBT as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort
[ https://issues.apache.org/jira/browse/SPARK-42668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696374#comment-17696374 ] Apache Spark commented on SPARK-42668: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40273 > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > --- > > Key: SPARK-42668 > URL: https://issues.apache.org/jira/browse/SPARK-42668 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > We have seen some cases where the task exits as cancelled/failed which > triggers the abort in the task completion listener for > HDFSStateStoreProvider. As part of this, we cancel the backing stream and > close the compressed stream. However, different stores such as Azure blob > store could throw exceptions which are not caught in the current path, > leading to job failures. This change proposes to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort
[ https://issues.apache.org/jira/browse/SPARK-42668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42668: Assignee: (was: Apache Spark) > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > --- > > Key: SPARK-42668 > URL: https://issues.apache.org/jira/browse/SPARK-42668 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > We have seen some cases where the task exits as cancelled/failed which > triggers the abort in the task completion listener for > HDFSStateStoreProvider. As part of this, we cancel the backing stream and > close the compressed stream. However, different stores such as Azure blob > store could throw exceptions which are not caught in the current path, > leading to job failures. This change proposes to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort
[ https://issues.apache.org/jira/browse/SPARK-42668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42668: Assignee: Apache Spark > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > --- > > Key: SPARK-42668 > URL: https://issues.apache.org/jira/browse/SPARK-42668 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Apache Spark >Priority: Major > > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > We have seen some cases where the task exits as cancelled/failed which > triggers the abort in the task completion listener for > HDFSStateStoreProvider. As part of this, we cancel the backing stream and > close the compressed stream. However, different stores such as Azure blob > store could throw exceptions which are not caught in the current path, > leading to job failures. This change proposes to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort
Anish Shrigondekar created SPARK-42668: -- Summary: Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort Key: SPARK-42668 URL: https://issues.apache.org/jira/browse/SPARK-42668 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Anish Shrigondekar Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort We have seen some cases where the task exits as cancelled/failed which triggers the abort in the task completion listener for HDFSStateStoreProvider. As part of this, we cancel the backing stream and close the compressed stream. However, different stores such as Azure blob store could throw exceptions which are not caught in the current path, leading to job failures. This change proposes to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort
[ https://issues.apache.org/jira/browse/SPARK-42668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696368#comment-17696368 ] Anish Shrigondekar commented on SPARK-42668: Will send out the fix soon cc - [~kabhwan] > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > --- > > Key: SPARK-42668 > URL: https://issues.apache.org/jira/browse/SPARK-42668 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > We have seen some cases where the task exits as cancelled/failed which > triggers the abort in the task completion listener for > HDFSStateStoreProvider. As part of this, we cancel the backing stream and > close the compressed stream. However, different stores such as Azure blob > store could throw exceptions which are not caught in the current path, > leading to job failures. This change proposes to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42667) Spark Connect: newSession API
[ https://issues.apache.org/jira/browse/SPARK-42667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696350#comment-17696350 ] Apache Spark commented on SPARK-42667: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40272 > Spark Connect: newSession API > - > > Key: SPARK-42667 > URL: https://issues.apache.org/jira/browse/SPARK-42667 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42667) Spark Connect: newSession API
[ https://issues.apache.org/jira/browse/SPARK-42667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42667: Assignee: Rui Wang (was: Apache Spark) > Spark Connect: newSession API > - > > Key: SPARK-42667 > URL: https://issues.apache.org/jira/browse/SPARK-42667 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42667) Spark Connect: newSession API
[ https://issues.apache.org/jira/browse/SPARK-42667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42667: Assignee: Apache Spark (was: Rui Wang) > Spark Connect: newSession API > - > > Key: SPARK-42667 > URL: https://issues.apache.org/jira/browse/SPARK-42667 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42667) Spark Connect: newSession API
Rui Wang created SPARK-42667: Summary: Spark Connect: newSession API Key: SPARK-42667 URL: https://issues.apache.org/jira/browse/SPARK-42667 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.4.1 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42666) Fix `createDataFrame` to work properly with rows and schema
[ https://issues.apache.org/jira/browse/SPARK-42666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42666: Summary: Fix `createDataFrame` to work properly with rows and schema (was: Fix `createDataFrame` to work properly) > Fix `createDataFrame` to work properly with rows and schema > --- > > Key: SPARK-42666 > URL: https://issues.apache.org/jira/browse/SPARK-42666 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > The code below is not working properly in Spark Connect: > {code:java} > >>> sdf = spark.range(10) > >>> spark.createDataFrame(sdf.tail(5), sdf.schema) > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 94, in > __repr__ > return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 162, in > dtypes > return [(str(f.name), f.dataType.simpleString()) for f in > self.schema.fields] > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1346, in > schema > self._schema = self._session.client.schema(query) > File "/.../spark/python/pyspark/sql/connect/client.py", line 614, in schema > proto_schema = self._analyze(method="schema", plan=plan).schema > File "/.../spark/python/pyspark/sql/connect/client.py", line 755, in > _analyze > self._handle_error(rpc_error) > File "/.../spark/python/pyspark/sql/connect/client.py", line 894, in > _handle_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: > [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's > required to be non-nullable.{code} > whereas working properly in regular PySpark: > {code:java} > >>> sdf = spark.range(10) > >>> spark.createDataFrame(sdf.tail(5), sdf.schema).show() > +---+ > | id| > +---+ > | 5| > | 6| > | 7| > | 8| > | 9| > +---+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42666) Fix `createDataFrame` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42666: Summary: Fix `createDataFrame` to work properly (was: Fix `tail` to work properly) > Fix `createDataFrame` to work properly > -- > > Key: SPARK-42666 > URL: https://issues.apache.org/jira/browse/SPARK-42666 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > The code below is not working properly in Spark Connect: > {code:java} > >>> sdf = spark.range(10) > >>> spark.createDataFrame(sdf.tail(5), sdf.schema) > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 94, in > __repr__ > return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 162, in > dtypes > return [(str(f.name), f.dataType.simpleString()) for f in > self.schema.fields] > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1346, in > schema > self._schema = self._session.client.schema(query) > File "/.../spark/python/pyspark/sql/connect/client.py", line 614, in schema > proto_schema = self._analyze(method="schema", plan=plan).schema > File "/.../spark/python/pyspark/sql/connect/client.py", line 755, in > _analyze > self._handle_error(rpc_error) > File "/.../spark/python/pyspark/sql/connect/client.py", line 894, in > _handle_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: > [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's > required to be non-nullable.{code} > whereas working properly in regular PySpark: > {code:java} > >>> sdf = spark.range(10) > >>> spark.createDataFrame(sdf.tail(5), sdf.schema).show() > +---+ > | id| > +---+ > | 5| > | 6| > | 7| > | 8| > | 9| > +---+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42662) Support `withSequenceColumn` as PySpark DataFrame internal function.
[ https://issues.apache.org/jira/browse/SPARK-42662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696317#comment-17696317 ] Apache Spark commented on SPARK-42662: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40270 > Support `withSequenceColumn` as PySpark DataFrame internal function. > > > Key: SPARK-42662 > URL: https://issues.apache.org/jira/browse/SPARK-42662 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Turn `withSequenceColumn` into PySpark internal API to support the > distributed-sequence index of the pandas API on Spark in Spark Connect as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42666) Fix `tail` to work properly
Haejoon Lee created SPARK-42666: --- Summary: Fix `tail` to work properly Key: SPARK-42666 URL: https://issues.apache.org/jira/browse/SPARK-42666 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.5.0 Reporter: Haejoon Lee The code below is not working properly in Spark Connect: {code:java} >>> sdf = spark.range(10) >>> spark.createDataFrame(sdf.tail(5), sdf.schema) Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 94, in __repr__ return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 162, in dtypes return [(str(f.name), f.dataType.simpleString()) for f in self.schema.fields] File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1346, in schema self._schema = self._session.client.schema(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 614, in schema proto_schema = self._analyze(method="schema", plan=plan).schema File "/.../spark/python/pyspark/sql/connect/client.py", line 755, in _analyze self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 894, in _handle_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.{code} whereas working properly in regular PySpark: {code:java} >>> sdf = spark.range(10) >>> spark.createDataFrame(sdf.tail(5), sdf.schema).show() +---+ | id| +---+ | 5| | 6| | 7| | 8| | 9| +---+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42662) Support `withSequenceColumn` as PySpark DataFrame internal function.
[ https://issues.apache.org/jira/browse/SPARK-42662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42662: Assignee: (was: Apache Spark) > Support `withSequenceColumn` as PySpark DataFrame internal function. > > > Key: SPARK-42662 > URL: https://issues.apache.org/jira/browse/SPARK-42662 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Turn `withSequenceColumn` into PySpark internal API to support the > distributed-sequence index of the pandas API on Spark in Spark Connect as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42662) Support `withSequenceColumn` as PySpark DataFrame internal function.
[ https://issues.apache.org/jira/browse/SPARK-42662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696314#comment-17696314 ] Apache Spark commented on SPARK-42662: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40270 > Support `withSequenceColumn` as PySpark DataFrame internal function. > > > Key: SPARK-42662 > URL: https://issues.apache.org/jira/browse/SPARK-42662 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Turn `withSequenceColumn` into PySpark internal API to support the > distributed-sequence index of the pandas API on Spark in Spark Connect as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42662) Support `withSequenceColumn` as PySpark DataFrame internal function.
[ https://issues.apache.org/jira/browse/SPARK-42662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42662: Assignee: Apache Spark > Support `withSequenceColumn` as PySpark DataFrame internal function. > > > Key: SPARK-42662 > URL: https://issues.apache.org/jira/browse/SPARK-42662 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Turn `withSequenceColumn` into PySpark internal API to support the > distributed-sequence index of the pandas API on Spark in Spark Connect as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42665) `simple udf` test failed using Maven
[ https://issues.apache.org/jira/browse/SPARK-42665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42665: - Description: {code:java} simple udf *** FAILED *** io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.ClientE2ETestSuite at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) at org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61) at org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106) at org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426) at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425) at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) {code} > `simple udf` test failed using Maven > - > > Key: SPARK-42665 > URL: https://issues.apache.org/jira/browse/SPARK-42665 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > simple udf *** FAILED *** > io.grpc.StatusRuntimeException: INTERNAL: > org.apache.spark.sql.ClientE2ETestSuite > at io.grpc.Status.asRuntimeException(Status.java:535) > at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > at > org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61) > at > org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106) > at > org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123) > at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426) > at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425) > at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42665) `simple udf` test failed using Maven
[ https://issues.apache.org/jira/browse/SPARK-42665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42665: - Attachment: (was: image-2023-03-04-01-41-51-522.png) > `simple udf` test failed using Maven > - > > Key: SPARK-42665 > URL: https://issues.apache.org/jira/browse/SPARK-42665 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > simple udf *** FAILED *** > io.grpc.StatusRuntimeException: INTERNAL: > org.apache.spark.sql.ClientE2ETestSuite > at io.grpc.Status.asRuntimeException(Status.java:535) > at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > at > org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61) > at > org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106) > at > org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123) > at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426) > at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425) > at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42665) `simple udf` test failed using Maven
[ https://issues.apache.org/jira/browse/SPARK-42665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42665: - Attachment: image-2023-03-04-01-41-51-522.png > `simple udf` test failed using Maven > - > > Key: SPARK-42665 > URL: https://issues.apache.org/jira/browse/SPARK-42665 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > Attachments: image-2023-03-04-01-41-51-522.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42665) `simple udf` test failed using Maven
Yang Jie created SPARK-42665: Summary: `simple udf` test failed using Maven Key: SPARK-42665 URL: https://issues.apache.org/jira/browse/SPARK-42665 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.4.0 Reporter: Yang Jie Attachments: image-2023-03-04-01-41-51-522.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42258) pyspark.sql.functions should not expose typing.cast
[ https://issues.apache.org/jira/browse/SPARK-42258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42258: Assignee: Apache Spark > pyspark.sql.functions should not expose typing.cast > --- > > Key: SPARK-42258 > URL: https://issues.apache.org/jira/browse/SPARK-42258 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.1 >Reporter: Furcy Pin >Assignee: Apache Spark >Priority: Minor > > In pyspark, the `pyspark.sql.functions` modules imports and exposes the > method `typing.cast`. > This may lead to errors from users that can be hard to spot. > *Example* > It took me a few minutes to understand why the following code: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as f > spark = SparkSession.builder.getOrCreate() > df = spark.sql("""SELECT 1 as a""") > df.withColumn("a", f.cast("STRING", f.col("a"))).printSchema() {code} > which executes without any problem, gives the following result: > > > {code:java} > root > |-- a: integer (nullable = false){code} > This is because `f.cast` here calls `typing.cast, and the correct syntax is: > {code:java} > df.withColumn("a", f.col("a").cast("STRING")).printSchema(){code} > > which indeed gives: > {code:java} > root > |-- a: string (nullable = false) {code} > *Suggestion of solution* > Option 1: The methods imported in the module `pyspark.sql.functions` could be > obfuscated to prevent this. For instance: > {code:java} > from typing import cast as _cast{code} > Option 2: only import `typing` and replace all occurrences of `cast` with > `typing.cast` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42258) pyspark.sql.functions should not expose typing.cast
[ https://issues.apache.org/jira/browse/SPARK-42258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42258: Assignee: (was: Apache Spark) > pyspark.sql.functions should not expose typing.cast > --- > > Key: SPARK-42258 > URL: https://issues.apache.org/jira/browse/SPARK-42258 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.1 >Reporter: Furcy Pin >Priority: Minor > > In pyspark, the `pyspark.sql.functions` modules imports and exposes the > method `typing.cast`. > This may lead to errors from users that can be hard to spot. > *Example* > It took me a few minutes to understand why the following code: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as f > spark = SparkSession.builder.getOrCreate() > df = spark.sql("""SELECT 1 as a""") > df.withColumn("a", f.cast("STRING", f.col("a"))).printSchema() {code} > which executes without any problem, gives the following result: > > > {code:java} > root > |-- a: integer (nullable = false){code} > This is because `f.cast` here calls `typing.cast, and the correct syntax is: > {code:java} > df.withColumn("a", f.col("a").cast("STRING")).printSchema(){code} > > which indeed gives: > {code:java} > root > |-- a: string (nullable = false) {code} > *Suggestion of solution* > Option 1: The methods imported in the module `pyspark.sql.functions` could be > obfuscated to prevent this. For instance: > {code:java} > from typing import cast as _cast{code} > Option 2: only import `typing` and replace all occurrences of `cast` with > `typing.cast` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42258) pyspark.sql.functions should not expose typing.cast
[ https://issues.apache.org/jira/browse/SPARK-42258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696237#comment-17696237 ] Apache Spark commented on SPARK-42258: -- User 'FurcyPin' has created a pull request for this issue: https://github.com/apache/spark/pull/40271 > pyspark.sql.functions should not expose typing.cast > --- > > Key: SPARK-42258 > URL: https://issues.apache.org/jira/browse/SPARK-42258 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.1 >Reporter: Furcy Pin >Priority: Minor > > In pyspark, the `pyspark.sql.functions` modules imports and exposes the > method `typing.cast`. > This may lead to errors from users that can be hard to spot. > *Example* > It took me a few minutes to understand why the following code: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as f > spark = SparkSession.builder.getOrCreate() > df = spark.sql("""SELECT 1 as a""") > df.withColumn("a", f.cast("STRING", f.col("a"))).printSchema() {code} > which executes without any problem, gives the following result: > > > {code:java} > root > |-- a: integer (nullable = false){code} > This is because `f.cast` here calls `typing.cast, and the correct syntax is: > {code:java} > df.withColumn("a", f.col("a").cast("STRING")).printSchema(){code} > > which indeed gives: > {code:java} > root > |-- a: string (nullable = false) {code} > *Suggestion of solution* > Option 1: The methods imported in the module `pyspark.sql.functions` could be > obfuscated to prevent this. For instance: > {code:java} > from typing import cast as _cast{code} > Option 2: only import `typing` and replace all occurrences of `cast` with > `typing.cast` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix `default_session` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Description: Currently, default_session is not working properly in Spark Connect as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} was: Currently, default_session is not working properly in Spark Connect as below since `SparkSession.conf.get` is nor working as expected: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} > Fix `default_session` to work properly > -- > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix `default_session` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Summary: Fix `default_session` to work properly (was: Fix `default_session ` to work properly) > Fix `default_session` to work properly > -- > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect as below > since `SparkSession.conf.get` is nor working as expected: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix `default_session ` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Summary: Fix `default_session ` to work properly (was: Fix `SparkSession.conf.get` to work properly) > Fix `default_session ` to work properly > --- > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect as below > since `SparkSession.conf.get` is nor working as expected: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42664) Support bloomFilter for DataFrameStatFunctions
Yang Jie created SPARK-42664: Summary: Support bloomFilter for DataFrameStatFunctions Key: SPARK-42664 URL: https://issues.apache.org/jira/browse/SPARK-42664 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix `SparkSession.conf.get` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Description: Currently, default_session is not working properly in Spark Connect as below since `SparkSession.conf.get` is nor working as expected: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} was: Currently, default_session is not working properly in Spark Connect: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} > Fix `SparkSession.conf.get` to work properly > > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect as below > since `SparkSession.conf.get` is nor working as expected: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix `SparkSession.conf.get` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Summary: Fix `SparkSession.conf.get` to work properly (was: Fix default_session to work properly in Spark Connect) > Fix `SparkSession.conf.get` to work properly > > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix default_session to work properly in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Description: Currently, default_session is not working properly in Spark Connect: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} was: Currently, default_session is not working properly in Spark Connect: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} > Fix default_session to work properly in Spark Connect > - > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42663) Fix default_session to work properly in Spark Connect
Haejoon Lee created SPARK-42663: --- Summary: Fix default_session to work properly in Spark Connect Key: SPARK-42663 URL: https://issues.apache.org/jira/browse/SPARK-42663 Project: Spark Issue Type: Sub-task Components: Connect, ps Affects Versions: 3.5.0 Reporter: Haejoon Lee Currently, default_session is not working properly in Spark Connect: >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type It should work as expected in regular PySpark as below: >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42663) Fix default_session to work properly in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42663: Description: Currently, default_session is not working properly in Spark Connect: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type {code} It should work as expected in regular PySpark as below: {code:java} >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence'{code} was: Currently, default_session is not working properly in Spark Connect: >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.util.NoSuchElementException) default_index_type It should work as expected in regular PySpark as below: >>> spark = default_session() >>> spark.conf.set("default_index_type", "sequence") >>> spark.conf.get("default_index_type") 'sequence' >>> >>> spark = default_session() >>> spark.conf.get("default_index_type") 'sequence' > Fix default_session to work properly in Spark Connect > - > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect: > > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42497) Support of pandas API on Spark for Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-42497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42497: Assignee: (was: Apache Spark) > Support of pandas API on Spark for Spark Connect. > - > > Key: SPARK-42497 > URL: https://issues.apache.org/jira/browse/SPARK-42497 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should enable `pandas API on Spark` on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42497) Support of pandas API on Spark for Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-42497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696206#comment-17696206 ] Apache Spark commented on SPARK-42497: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40270 > Support of pandas API on Spark for Spark Connect. > - > > Key: SPARK-42497 > URL: https://issues.apache.org/jira/browse/SPARK-42497 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should enable `pandas API on Spark` on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42497) Support of pandas API on Spark for Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-42497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42497: Assignee: Apache Spark > Support of pandas API on Spark for Spark Connect. > - > > Key: SPARK-42497 > URL: https://issues.apache.org/jira/browse/SPARK-42497 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should enable `pandas API on Spark` on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42662) Support `withSequenceColumn` as PySpark DataFrame internal function.
Haejoon Lee created SPARK-42662: --- Summary: Support `withSequenceColumn` as PySpark DataFrame internal function. Key: SPARK-42662 URL: https://issues.apache.org/jira/browse/SPARK-42662 Project: Spark Issue Type: Sub-task Components: Connect, Pandas API on Spark, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Turn `withSequenceColumn` into PySpark internal API to support the distributed-sequence index of the pandas API on Spark in Spark Connect as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696177#comment-17696177 ] Apache Spark commented on SPARK-42500: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/40268 > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696176#comment-17696176 ] Apache Spark commented on SPARK-42500: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/40268 > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42648) Upgrade versions-maven-plugin to 2.15.0
[ https://issues.apache.org/jira/browse/SPARK-42648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42648: - Priority: Trivial (was: Major) > Upgrade versions-maven-plugin to 2.15.0 > --- > > Key: SPARK-42648 > URL: https://issues.apache.org/jira/browse/SPARK-42648 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 3.5.0 > > > https://github.com/mojohaus/versions/releases/tag/2.15.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42648) Upgrade versions-maven-plugin to 2.15.0
[ https://issues.apache.org/jira/browse/SPARK-42648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-42648: Assignee: Yang Jie > Upgrade versions-maven-plugin to 2.15.0 > --- > > Key: SPARK-42648 > URL: https://issues.apache.org/jira/browse/SPARK-42648 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > https://github.com/mojohaus/versions/releases/tag/2.15.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42648) Upgrade versions-maven-plugin to 2.15.0
[ https://issues.apache.org/jira/browse/SPARK-42648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42648. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40248 [https://github.com/apache/spark/pull/40248] > Upgrade versions-maven-plugin to 2.15.0 > --- > > Key: SPARK-42648 > URL: https://issues.apache.org/jira/browse/SPARK-42648 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > https://github.com/mojohaus/versions/releases/tag/2.15.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696144#comment-17696144 ] Apache Spark commented on SPARK-42653: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40267 > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.1 > > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42653. --- Fix Version/s: 3.4.1 Assignee: Venkata Sai Akhil Gudesa Resolution: Fixed > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.1 > > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42558) Implement DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696123#comment-17696123 ] Herman van Hövell commented on SPARK-42558: --- [~LuciferYang] that is fine for now. We can add support for BloomFilters and CMS later. > Implement DataFrameStatFunctions > > > Key: SPARK-42558 > URL: https://issues.apache.org/jira/browse/SPARK-42558 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement DataFrameStatFunctions for connect, and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42661) CSV Reader - multiline without quoted fields
[ https://issues.apache.org/jira/browse/SPARK-42661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian FERREIRA updated SPARK-42661: - Attachment: Capture d’écran 2023-03-03 à 12.18.07.png > CSV Reader - multiline without quoted fields > > > Key: SPARK-42661 > URL: https://issues.apache.org/jira/browse/SPARK-42661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: unquoted data > {code} > NAME,Address,CITY > Atlassian,Level 6 341 George Street > Sydney NSW 2000 Australia,Sydney > Github,88 Colin P Kelly Junior Street > San Francisco CA 94107 USA,San Francisco > {code} > quoted data : > {code} > "NAME","Address","CITY" > "Atlassian","Level 6 341 George Street > Sydney NSW 2000 Australia","Sydney" > "Github","88 Colin P Kelly Junior Street > San Francisco CA 94107 USA","San Francisco" > {code} >Reporter: Florian FERREIRA >Priority: Minor > Attachments: Capture d’écran 2023-03-03 à 12.18.07.png > > > Hello, > We are facing an issue with the CSV format. > When we try to read a "multiline file without quoted fields" the expected > result is not good. > With quoted fields, all is ok. ( cf the screenshot ) > You can reproduce it easily with this code (just replace file path ) : > {code:java} > spark.read.options(Map( > "multiline" -> "true", > "quote" -> "", > "header" -> "true", > )).csv("/Users/fferreira/correct_multiline.csv").show(false) > spark.read.options(Map( > "multiline" -> "true", > "header" -> "true", > )).csv("/Users/fferreira/correct_multiline_with_quote.csv").show(false) > {code} > We continue to investigate on our side. > Thanks you. > !image-2023-03-03-12-11-21-258.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42661) CSV Reader - multiline without quoted fields
Florian FERREIRA created SPARK-42661: Summary: CSV Reader - multiline without quoted fields Key: SPARK-42661 URL: https://issues.apache.org/jira/browse/SPARK-42661 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1 Environment: unquoted data {code} NAME,Address,CITY Atlassian,Level 6 341 George Street Sydney NSW 2000 Australia,Sydney Github,88 Colin P Kelly Junior Street San Francisco CA 94107 USA,San Francisco {code} quoted data : {code} "NAME","Address","CITY" "Atlassian","Level 6 341 George Street Sydney NSW 2000 Australia","Sydney" "Github","88 Colin P Kelly Junior Street San Francisco CA 94107 USA","San Francisco" {code} Reporter: Florian FERREIRA Hello, We are facing an issue with the CSV format. When we try to read a "multiline file without quoted fields" the expected result is not good. With quoted fields, all is ok. ( cf the screenshot ) You can reproduce it easily with this code (just replace file path ) : {code:java} spark.read.options(Map( "multiline" -> "true", "quote" -> "", "header" -> "true", )).csv("/Users/fferreira/correct_multiline.csv").show(false) spark.read.options(Map( "multiline" -> "true", "header" -> "true", )).csv("/Users/fferreira/correct_multiline_with_quote.csv").show(false) {code} We continue to investigate on our side. Thanks you. !image-2023-03-03-12-11-21-258.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42661) CSV Reader - multiline without quoted fields
[ https://issues.apache.org/jira/browse/SPARK-42661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian FERREIRA updated SPARK-42661: - Priority: Minor (was: Major) > CSV Reader - multiline without quoted fields > > > Key: SPARK-42661 > URL: https://issues.apache.org/jira/browse/SPARK-42661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: unquoted data > {code} > NAME,Address,CITY > Atlassian,Level 6 341 George Street > Sydney NSW 2000 Australia,Sydney > Github,88 Colin P Kelly Junior Street > San Francisco CA 94107 USA,San Francisco > {code} > quoted data : > {code} > "NAME","Address","CITY" > "Atlassian","Level 6 341 George Street > Sydney NSW 2000 Australia","Sydney" > "Github","88 Colin P Kelly Junior Street > San Francisco CA 94107 USA","San Francisco" > {code} >Reporter: Florian FERREIRA >Priority: Minor > > Hello, > We are facing an issue with the CSV format. > When we try to read a "multiline file without quoted fields" the expected > result is not good. > With quoted fields, all is ok. ( cf the screenshot ) > You can reproduce it easily with this code (just replace file path ) : > {code:java} > spark.read.options(Map( > "multiline" -> "true", > "quote" -> "", > "header" -> "true", > )).csv("/Users/fferreira/correct_multiline.csv").show(false) > spark.read.options(Map( > "multiline" -> "true", > "header" -> "true", > )).csv("/Users/fferreira/correct_multiline_with_quote.csv").show(false) > {code} > We continue to investigate on our side. > Thanks you. > !image-2023-03-03-12-11-21-258.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696097#comment-17696097 ] jiaan.geng commented on SPARK-42557: I will take a look! > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555 ] jiaan.geng deleted comment on SPARK-42555: was (Author: beliefer): I will take a look! > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.
[ https://issues.apache.org/jira/browse/SPARK-42556 ] jiaan.geng deleted comment on SPARK-42556: was (Author: beliefer): I'm working on. > Dataset.colregex should link a plan_id when it only matches a single column. > > > Key: SPARK-42556 > URL: https://issues.apache.org/jira/browse/SPARK-42556 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > When colregex returns a single column it should link the plans plan_id. For > reference here is the non-connect Dataset code that does this: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512] > This also needs to be fixed for the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org