[jira] [Updated] (SPARK-44233) Support an outer outer context in subquery resolution
[ https://issues.apache.org/jira/browse/SPARK-44233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-44233: -- Description: {code:java} >>> sql("select * from range(8) t, lateral (select * from t) s") Traceback (most recent call last): ... pyspark.errors.exceptions.captured.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `t` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 49; 'Project [*] +- 'LateralJoin lateral-subquery#0 [], Inner : +- 'SubqueryAlias s : +- 'Project [*] : +- 'UnresolvedRelation [t], [], false +- SubqueryAlias t +- Range (0, 8, step=1, splits=None){code} > Support an outer outer context in subquery resolution > - > > Key: SPARK-44233 > URL: https://issues.apache.org/jira/browse/SPARK-44233 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Priority: Major > > {code:java} > >>> sql("select * from range(8) t, lateral (select * from t) s") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.captured.AnalysisException: > [TABLE_OR_VIEW_NOT_FOUND] The table or view `t` cannot be found. Verify the > spelling and correctness of the schema and catalog. > If you did not qualify the name with a schema, verify the current_schema() > output, or qualify the name with the correct schema and catalog. > To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF > EXISTS.; line 1 pos 49; > 'Project [*] > +- 'LateralJoin lateral-subquery#0 [], Inner > : +- 'SubqueryAlias s > : +- 'Project [*] > : +- 'UnresolvedRelation [t], [], false > +- SubqueryAlias t > +- Range (0, 8, step=1, splits=None){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44233) Support an outer outer context in subquery resolution
Takuya Ueshin created SPARK-44233: - Summary: Support an outer outer context in subquery resolution Key: SPARK-44233 URL: https://issues.apache.org/jira/browse/SPARK-44233 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44200) Support TABLE argument parser rule for TableValuedFunction
Takuya Ueshin created SPARK-44200: - Summary: Support TABLE argument parser rule for TableValuedFunction Key: SPARK-44200 URL: https://issues.apache.org/jira/browse/SPARK-44200 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43804) Test on nested structs support in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-43804. --- Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 41320 https://github.com/apache/spark/pull/41320 > Test on nested structs support in Pandas UDF > > > Key: SPARK-43804 > URL: https://issues.apache.org/jira/browse/SPARK-43804 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Test on nested structs support in Pandas UDF. That support is newly enabled > (compared to Spark 3.4). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43817) Support UserDefinedType in creaetDataFrame from pandas DataFrame and toPandas
Takuya Ueshin created SPARK-43817: - Summary: Support UserDefinedType in creaetDataFrame from pandas DataFrame and toPandas Key: SPARK-43817 URL: https://issues.apache.org/jira/browse/SPARK-43817 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43759) Expose TimestampNTZType in pyspark.sql.types
Takuya Ueshin created SPARK-43759: - Summary: Expose TimestampNTZType in pyspark.sql.types Key: SPARK-43759 URL: https://issues.apache.org/jira/browse/SPARK-43759 Project: Spark Issue Type: Improvement Components: python Affects Versions: 3.5.0 Reporter: Takuya Ueshin {{TimestampNTZType}} is missing in {{__all__}} list in {{pyspark.sql.types}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43531) Enable more parity tests for Pandas UDFs.
Takuya Ueshin created SPARK-43531: - Summary: Enable more parity tests for Pandas UDFs. Key: SPARK-43531 URL: https://issues.apache.org/jira/browse/SPARK-43531 Project: Spark Issue Type: Test Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.
Takuya Ueshin created SPARK-43528: - Summary: Support duplicated field names in createDataFrame with pandas DataFrame. Key: SPARK-43528 URL: https://issues.apache.org/jira/browse/SPARK-43528 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43473) Support struct type in createDataFrame from pandas DataFrame
Takuya Ueshin created SPARK-43473: - Summary: Support struct type in createDataFrame from pandas DataFrame Key: SPARK-43473 URL: https://issues.apache.org/jira/browse/SPARK-43473 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin Support struct type in createDataFrame from pandas DataFrame with {{Row}} object or {{dict}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43363) Remove a workaround for pandas categorical type for pyarrow
Takuya Ueshin created SPARK-43363: - Summary: Remove a workaround for pandas categorical type for pyarrow Key: SPARK-43363 URL: https://issues.apache.org/jira/browse/SPARK-43363 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Takuya Ueshin Now that the minimum version of pyarrow is {{1.0.0}}, a workaround for pandas' categorical type for pyarrow can be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43323) DataFrame.toPandas with Arrow enabled should handle exceptions properly
Takuya Ueshin created SPARK-43323: - Summary: DataFrame.toPandas with Arrow enabled should handle exceptions properly Key: SPARK-43323 URL: https://issues.apache.org/jira/browse/SPARK-43323 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Takuya Ueshin Currently {{DataFrame.toPandas}} doesn't capture exceptions happened in Spark properly. {code:python} >>> spark.conf.set("spark.sql.ansi.enabled", True) >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> spark.sql("select 1/0").toPandas() ... An error occurred while calling o53.getResult. : org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322) ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43153) Skip Spark execution when the dataframe is local.
Takuya Ueshin created SPARK-43153: - Summary: Skip Spark execution when the dataframe is local. Key: SPARK-43153 URL: https://issues.apache.org/jira/browse/SPARK-43153 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43146) Implement eager evaluation.
Takuya Ueshin created SPARK-43146: - Summary: Implement eager evaluation. Key: SPARK-43146 URL: https://issues.apache.org/jira/browse/SPARK-43146 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43115) Split pyspark-pandas-connect from pyspark-connect module.
Takuya Ueshin created SPARK-43115: - Summary: Split pyspark-pandas-connect from pyspark-connect module. Key: SPARK-43115 URL: https://issues.apache.org/jira/browse/SPARK-43115 Project: Spark Issue Type: Test Components: PySpark, Tests Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level Connect add support Storagelevel
[ https://issues.apache.org/jira/browse/SPARK-42437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-42437. --- Fix Version/s: 3.5.0 Assignee: Khalid Mammadov Resolution: Fixed Issue resolved by pull request 40015 https://github.com/apache/spark/pull/40015 > Pyspark catalog.cacheTable allow to specify storage level Connect add support > Storagelevel > -- > > Key: SPARK-42437 > URL: https://issues.apache.org/jira/browse/SPARK-42437 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Major > Fix For: 3.5.0 > > > Currently PySpark version of catalog.cacheTable function does not support to > specify storage level. This is to add that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43062) Add options to lint-python to run each test separately
[ https://issues.apache.org/jira/browse/SPARK-43062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-43062: -- Priority: Minor (was: Major) > Add options to lint-python to run each test separately > -- > > Key: SPARK-43062 > URL: https://issues.apache.org/jira/browse/SPARK-43062 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43062) Add options to lint-python to run each test separately
Takuya Ueshin created SPARK-43062: - Summary: Add options to lint-python to run each test separately Key: SPARK-43062 URL: https://issues.apache.org/jira/browse/SPARK-43062 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43055) createDataFrame should support duplicated nested field names
Takuya Ueshin created SPARK-43055: - Summary: createDataFrame should support duplicated nested field names Key: SPARK-43055 URL: https://issues.apache.org/jira/browse/SPARK-43055 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42998) Fix DataFrame.collect with null struct.
Takuya Ueshin created SPARK-42998: - Summary: Fix DataFrame.collect with null struct. Key: SPARK-42998 URL: https://issues.apache.org/jira/browse/SPARK-42998 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin In Spark Connect: {code:python} >>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)") >>> df.show() +++ | a| b| +++ | 1| {a}| |null|null| +++ >>> df.collect() [Row(a=1, b=Row(x='a')), Row(a=None, b=)] {code} whereas PySpark: {code:python} >>> df.collect() [Row(a=1, b=Row(x='a')), Row(a=None, b=None)] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42985) Fix createDataFrame from pandas to respect session timezone.
Takuya Ueshin created SPARK-42985: - Summary: Fix createDataFrame from pandas to respect session timezone. Key: SPARK-42985 URL: https://issues.apache.org/jira/browse/SPARK-42985 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42984) Fix test_createDataFrame_with_single_data_type.
Takuya Ueshin created SPARK-42984: - Summary: Fix test_createDataFrame_with_single_data_type. Key: SPARK-42984 URL: https://issues.apache.org/jira/browse/SPARK-42984 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin PySpark raises an exception when: {code:python} >>> spark.createDataFrame(pd.DataFrame({"a": [1]}), schema="int").collect() Traceback (most recent call last): ... TypeError: field value: IntegerType() can not accept object (1,) in type {code} whereas Spark Connect doesn't. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42983) Fix the error message of createDataFrame from np.array(0)
Takuya Ueshin created SPARK-42983: - Summary: Fix the error message of createDataFrame from np.array(0) Key: SPARK-42983 URL: https://issues.apache.org/jira/browse/SPARK-42983 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} >>> import numpy as np >>> spark.createDataFrame(np.array(0)) Traceback (most recent call last): ... TypeError: len() of unsized object {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42982) Fix createDataFrame from pandas with map type
Takuya Ueshin created SPARK-42982: - Summary: Fix createDataFrame from pandas with map type Key: SPARK-42982 URL: https://issues.apache.org/jira/browse/SPARK-42982 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} >>> import pandas as pd >>> >>> map_data = [{"a": 1}, {"b": 2, "c": 3}, {}, None, {"d": None}] >>> pdf = pd.DataFrame({"id": [0, 1, 2, 3, 4], "m": map_data}) >>> schema = "id long, m map" >>> spark.createDataFrame(pdf, schema=schema) Traceback (most recent call last): ... pyspark.errors.exceptions.connect.AnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `col_1` is of type "STRUCT" while it's required to be "MAP". {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42970) Reuse pyspark.sql.tests.test_arrow test cases
Takuya Ueshin created SPARK-42970: - Summary: Reuse pyspark.sql.tests.test_arrow test cases Key: SPARK-42970 URL: https://issues.apache.org/jira/browse/SPARK-42970 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42969) Fix the comparison the result with Arrow optimization enabled/disabled.
Takuya Ueshin created SPARK-42969: - Summary: Fix the comparison the result with Arrow optimization enabled/disabled. Key: SPARK-42969 URL: https://issues.apache.org/jira/browse/SPARK-42969 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.4.0 Reporter: Takuya Ueshin in {{{}test_arrow{}}}, there are a bunch of comparison between DataFrames with Arrow optimization enabled/disabled. These should be fixed to compare with the expected values so that it can be reusable for Spark Connect parity tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42920) Python UDF with UDT
Takuya Ueshin created SPARK-42920: - Summary: Python UDF with UDT Key: SPARK-42920 URL: https://issues.apache.org/jira/browse/SPARK-42920 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42911) Introduce more basic exceptions.
Takuya Ueshin created SPARK-42911: - Summary: Introduce more basic exceptions. Key: SPARK-42911 URL: https://issues.apache.org/jira/browse/SPARK-42911 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42900) Fix createDataFrame to respect both type inference and column names.
Takuya Ueshin created SPARK-42900: - Summary: Fix createDataFrame to respect both type inference and column names. Key: SPARK-42900 URL: https://issues.apache.org/jira/browse/SPARK-42900 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42899) DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field
[ https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42899: -- Summary: DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field (was: DataFrame.to(schema) fails with the schema of itself.) > DataFrame.to(schema) fails when it contains non-nullable nested field in > nullable field > --- > > Key: SPARK-42899 > URL: https://issues.apache.org/jira/browse/SPARK-42899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in > nullable field: > {code:scala} > scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, > b)") > df: org.apache.spark.sql.DataFrame = [a: int, b: struct] > scala> df.printSchema() > root > |-- a: integer (nullable = true) > |-- b: struct (nullable = true) > ||-- i: integer (nullable = false) > scala> df.to(df.schema) > org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or > field `b`.`i` is nullable while it's required to be non-nullable. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42899) DataFrame.to(schema) fails with the schema of itself.
[ https://issues.apache.org/jira/browse/SPARK-42899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42899: -- Description: {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in nullable field: {code:scala} scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) ||-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. {code} was: {{DataFrame.to(schema)}} fails with the schema of itself, when it contains non-nullable nested field in nullable field: {code:scala} scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) ||-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. {code} > DataFrame.to(schema) fails with the schema of itself. > - > > Key: SPARK-42899 > URL: https://issues.apache.org/jira/browse/SPARK-42899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {{DataFrame.to(schema)}} fails when it contains non-nullable nested field in > nullable field: > {code:scala} > scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, > b)") > df: org.apache.spark.sql.DataFrame = [a: int, b: struct] > scala> df.printSchema() > root > |-- a: integer (nullable = true) > |-- b: struct (nullable = true) > ||-- i: integer (nullable = false) > scala> df.to(df.schema) > org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or > field `b`.`i` is nullable while it's required to be non-nullable. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42899) DataFrame.to(schema) fails with the schema of itself.
Takuya Ueshin created SPARK-42899: - Summary: DataFrame.to(schema) fails with the schema of itself. Key: SPARK-42899 URL: https://issues.apache.org/jira/browse/SPARK-42899 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Takuya Ueshin {{DataFrame.to(schema)}} fails with the schema of itself, when it contains non-nullable nested field in nullable field: {code:scala} scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) ||-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42889) Implement cache, persist, unpersist, and storageLevel
Takuya Ueshin created SPARK-42889: - Summary: Implement cache, persist, unpersist, and storageLevel Key: SPARK-42889 URL: https://issues.apache.org/jira/browse/SPARK-42889 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42875) Fix toPandas to handle timezone and map types properly.
Takuya Ueshin created SPARK-42875: - Summary: Fix toPandas to handle timezone and map types properly. Key: SPARK-42875 URL: https://issues.apache.org/jira/browse/SPARK-42875 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42848) Implement DataFrame.registerTempTable
Takuya Ueshin created SPARK-42848: - Summary: Implement DataFrame.registerTempTable Key: SPARK-42848 URL: https://issues.apache.org/jira/browse/SPARK-42848 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41922) Implement DataFrame `semanticHash`
[ https://issues.apache.org/jira/browse/SPARK-41922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-41922. --- Resolution: Duplicate > Implement DataFrame `semanticHash` > -- > > Key: SPARK-41922 > URL: https://issues.apache.org/jira/browse/SPARK-41922 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42818) Implement DataFrameReader/Writer.jdbc
Takuya Ueshin created SPARK-42818: - Summary: Implement DataFrameReader/Writer.jdbc Key: SPARK-42818 URL: https://issues.apache.org/jira/browse/SPARK-42818 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42733) df.write.format().save() should support calling with no path or table name
[ https://issues.apache.org/jira/browse/SPARK-42733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42733: -- Parent: SPARK-41284 Issue Type: Sub-task (was: Bug) > df.write.format().save() should support calling with no path or table name > -- > > Key: SPARK-42733 > URL: https://issues.apache.org/jira/browse/SPARK-42733 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > When calling `session.range(5).write.format("xxx").options().save()` Spark > Connect currently throws an assertion error because it expects that either > path or tableName are present. According to our current PySpark > implementation that is not necessary though. > > {code:python} > if format is not None: > self.format(format) > if path is None: > self._jwrite.save() > else: > self._jwrite.save(path) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42705) SparkSession.sql doesn't return values from commands.
Takuya Ueshin created SPARK-42705: - Summary: SparkSession.sql doesn't return values from commands. Key: SPARK-42705 URL: https://issues.apache.org/jira/browse/SPARK-42705 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} >>> spark.sql("show functions").show() ++ |function| ++ ++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41843) Implement SparkSession.udf
[ https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-41843. --- Fix Version/s: 3.4.0 Resolution: Fixed > Implement SparkSession.udf > -- > > Key: SPARK-41843 > URL: https://issues.apache.org/jira/browse/SPARK-41843 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2331, in pyspark.sql.connect.functions.call_udf > Failed example: > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42624) Reorganize imports in test_functions
[ https://issues.apache.org/jira/browse/SPARK-42624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42624: -- Component/s: PySpark (was: SQL) > Reorganize imports in test_functions > > > Key: SPARK-42624 > URL: https://issues.apache.org/jira/browse/SPARK-42624 > Project: Spark > Issue Type: Task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42624) Reorganize imports in test_functions
Takuya Ueshin created SPARK-42624: - Summary: Reorganize imports in test_functions Key: SPARK-42624 URL: https://issues.apache.org/jira/browse/SPARK-42624 Project: Spark Issue Type: Task Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42612) Enable more parity tests related to functions
Takuya Ueshin created SPARK-42612: - Summary: Enable more parity tests related to functions Key: SPARK-42612 URL: https://issues.apache.org/jira/browse/SPARK-42612 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42510) Implement `DataFrame.mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-42510. --- Fix Version/s: 3.4.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 40104 https://github.com/apache/spark/pull/40104 > Implement `DataFrame.mapInPandas` > - > > Key: SPARK-42510 > URL: https://issues.apache.org/jira/browse/SPARK-42510 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `DataFrame.mapInPandas` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42574) DataFrame.toPandas should handle duplicated column names
Takuya Ueshin created SPARK-42574: - Summary: DataFrame.toPandas should handle duplicated column names Key: SPARK-42574 URL: https://issues.apache.org/jira/browse/SPARK-42574 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} spark.sql("select 1 v, 1 v").toPandas() {code} should work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42570) Fix DataFrameReader to use the default source
Takuya Ueshin created SPARK-42570: - Summary: Fix DataFrameReader to use the default source Key: SPARK-42570 URL: https://issues.apache.org/jira/browse/SPARK-42570 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} spark.read.load(path) {code} should work without specifying the format. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42568) SparkConnectStreamHandler should manage configs properly while creating plans.
Takuya Ueshin created SPARK-42568: - Summary: SparkConnectStreamHandler should manage configs properly while creating plans. Key: SPARK-42568 URL: https://issues.apache.org/jira/browse/SPARK-42568 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin Some components for planning need to check configs in {{SQLConf.get}} while building the plan, but currently it's unavailable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42522) Fix DataFrameWriterV2 to find the default source
Takuya Ueshin created SPARK-42522: - Summary: Fix DataFrameWriterV2 to find the default source Key: SPARK-42522 URL: https://issues.apache.org/jira/browse/SPARK-42522 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} df.writeTo("test_table").create() {code} throws: {noformat} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed to find the data source: . Please find packages at `https://spark.apache.org/third-party-projects.html`. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41901) Parity in String representation of Column
[ https://issues.apache.org/jira/browse/SPARK-41901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690652#comment-17690652 ] Takuya Ueshin commented on SPARK-41901: --- For the first case, {{{}ACOSH{}}}, {{{}ASINH{}}}, and {{ATANH}} returns upper case name in PySpark because their \{{prettyName}}s use upper case names, whereas lower cases in Spark Connect. [~gurwls223] [~podongfeng] Is it ok to compare them non case-sensitive way to enable the still skipped test {{{}FunctionsParityTests.test_inverse_trig_functions{}}}? > Parity in String representation of Column > - > > Key: SPARK-41901 > URL: https://issues.apache.org/jira/browse/SPARK-41901 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code:java} > from pyspark.sql import functions > funs = [ > (functions.acosh, "ACOSH"), > (functions.asinh, "ASINH"), > (functions.atanh, "ATANH"), > ] > cols = ["a", functions.col("a")] > for f, alias in funs: > for c in cols: > self.assertIn(f"{alias}(a)", repr(f(c))){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 271, in test_inverse_trig_functions > self.assertIn(f"{alias}(a)", repr(f(c))) > AssertionError: 'ACOSH(a)' not found in > "Column<'acosh(ColumnReference(a))'>"{code} > > > {code:java} > from pyspark.sql.functions import col, lit, overlay > from itertools import chain > import re > actual = list( > chain.from_iterable( > [ > re.findall("(overlay\\(.*\\))", str(x)) > for x in [ > overlay(col("foo"), col("bar"), 1), > overlay("x", "y", 3), > overlay(col("x"), col("y"), 1, 3), > overlay("x", "y", 2, 5), > overlay("x", "y", lit(11)), > overlay("x", "y", lit(2), lit(5)), > ] > ] > ) > ) > expected = [ > "overlay(foo, bar, 1, -1)", > "overlay(x, y, 3, -1)", > "overlay(x, y, 1, 3)", > "overlay(x, y, 2, 5)", > "overlay(x, y, 11, -1)", > "overlay(x, y, 2, 5)", > ] > self.assertListEqual(actual, expected) > df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", > "pos", "len")) > exp = [Row(ol="SPARK_CORESQL")] > self.assertTrue( > all( > [ > df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp, > df.select(overlay(df.x, df.y, lit(7), > lit(0)).alias("ol")).collect() == exp, > df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() > == exp, > ] > ) > ) {code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 675, in test_overlay > self.assertListEqual(actual, expected) > AssertionError: Lists differ: ['overlay(ColumnReference(foo), > ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', > 'overlay(x, y, 3, -1)'[90 chars] 5)'] > First differing element 0: > 'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))' > 'overlay(foo, bar, 1, -1)' > - ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), > Literal(-1))', > - 'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))', > - 'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))', > - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))', > - 'overlay(ColumnReference(x), ColumnReference(y), Literal(11), > Literal(-1))', > - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))'] > + ['overlay(foo, bar, 1, -1)', > + 'overlay(x, y, 3, -1)', > + 'overlay(x, y, 1, 3)', > + 'overlay(x, y, 2, 5)', > + 'overlay(x, y, 11, -1)', > + 'overlay(x, y, 2, 5)'] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42458) createDataFrame should support DDL string as schema
Takuya Ueshin created SPARK-42458: - Summary: createDataFrame should support DDL string as schema Key: SPARK-42458 URL: https://issues.apache.org/jira/browse/SPARK-42458 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin {code:python} File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in pyspark.sql.connect.readwriter.DataFrameWriter.option Failed example: with tempfile.TemporaryDirectory() as d: # Write a DataFrame into a CSV file with 'nullValue' option set to 'Hyukjin Kwon'. df = spark.createDataFrame([(100, None)], "age INT, name STRING") df.write.option("nullValue", "Hyukjin Kwon").mode("overwrite").format("csv").save(d) # Read the CSV file as a DataFrame. spark.read.schema(df.schema).format('csv').load(d).show() Exception raised: Traceback (most recent call last): File "/.../lib/python3.9/doctest.py", line 1334, in __run exec(compile(example.source, filename, "single", File "", line 3, in df = spark.createDataFrame([(100, None)], "age INT, name STRING") File "/.../python/pyspark/sql/connect/session.py", line 312, in createDataFrame raise ValueError( ValueError: Some of types cannot be determined after inferring, a StructType Schema is required in this case {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42426) insertInto fails when the column names are different from the table columns
[ https://issues.apache.org/jira/browse/SPARK-42426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42426: -- Summary: insertInto fails when the column names are different from the table columns (was: insertInto doesn't insert when the column names are different from the table columns) > insertInto fails when the column names are different from the table columns > --- > > Key: SPARK-42426 > URL: https://issues.apache.org/jira/browse/SPARK-42426 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {noformat} > File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in > pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[3]>", line 1, in > > df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") > File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in > insertInto > self.saveAsTable(tableName) > File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in > saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File "/.../python/pyspark/sql/connect/client.py", line 553, in > execute_command > self._execute(req) > File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute > self._handle_error(rpc_error) > File "/.../python/pyspark/sql/connect/client.py", line 718, in > _handle_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' > given input columns: [col1, col2]. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42426) insertInto doesn't insert when the column names are different from the table columns
Takuya Ueshin created SPARK-42426: - Summary: insertInto doesn't insert when the column names are different from the table columns Key: SPARK-42426 URL: https://issues.apache.org/jira/browse/SPARK-42426 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto Failed example: df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") Exception raised: Traceback (most recent call last): File "/.../lib/python3.9/doctest.py", line 1334, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in insertInto self.saveAsTable(tableName) File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in saveAsTable self._spark.client.execute_command(self._write.command(self._spark.client)) File "/.../python/pyspark/sql/connect/client.py", line 553, in execute_command self._execute(req) File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute self._handle_error(rpc_error) File "/.../python/pyspark/sql/connect/client.py", line 718, in _handle_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' given input columns: [col1, col2]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42426) insertInto doesn't insert when the column names are different from the table columns
[ https://issues.apache.org/jira/browse/SPARK-42426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42426: -- Description: {noformat} File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto Failed example: df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") Exception raised: Traceback (most recent call last): File "/.../lib/python3.9/doctest.py", line 1334, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in insertInto self.saveAsTable(tableName) File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in saveAsTable self._spark.client.execute_command(self._write.command(self._spark.client)) File "/.../python/pyspark/sql/connect/client.py", line 553, in execute_command self._execute(req) File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute self._handle_error(rpc_error) File "/.../python/pyspark/sql/connect/client.py", line 718, in _handle_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' given input columns: [col1, col2]. {noformat} was: File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto Failed example: df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") Exception raised: Traceback (most recent call last): File "/.../lib/python3.9/doctest.py", line 1334, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in insertInto self.saveAsTable(tableName) File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in saveAsTable self._spark.client.execute_command(self._write.command(self._spark.client)) File "/.../python/pyspark/sql/connect/client.py", line 553, in execute_command self._execute(req) File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute self._handle_error(rpc_error) File "/.../python/pyspark/sql/connect/client.py", line 718, in _handle_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' given input columns: [col1, col2]. > insertInto doesn't insert when the column names are different from the table > columns > > > Key: SPARK-42426 > URL: https://issues.apache.org/jira/browse/SPARK-42426 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {noformat} > File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in > pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[3]>", line 1, in > > df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA") > File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in > insertInto > self.saveAsTable(tableName) > File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in > saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File "/.../python/pyspark/sql/connect/client.py", line 553, in > execute_command > self._execute(req) > File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute > self._handle_error(rpc_error) > File "/.../python/pyspark/sql/connect/client.py", line 718, in > _handle_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' > given input columns: [col1, col2]. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41870) Handle duplicate columns in `createDataFrame`
[ https://issues.apache.org/jira/browse/SPARK-41870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-41870: -- Attachment: (was: session.py) > Handle duplicate columns in `createDataFrame` > - > > Key: SPARK-41870 > URL: https://issues.apache.org/jira/browse/SPARK-41870 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 65, in test_duplicated_column_names > df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 277, in createDataFrame > raise ValueError( > ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 > elements{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41870) Handle duplicate columns in `createDataFrame`
[ https://issues.apache.org/jira/browse/SPARK-41870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-41870: -- Attachment: session.py > Handle duplicate columns in `createDataFrame` > - > > Key: SPARK-41870 > URL: https://issues.apache.org/jira/browse/SPARK-41870 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 65, in test_duplicated_column_names > df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 277, in createDataFrame > raise ValueError( > ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 > elements{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42265) DataFrame.createTempView - SparkConnectGrpcException: requirement failed
[ https://issues.apache.org/jira/browse/SPARK-42265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-42265. --- Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 39968 https://github.com/apache/spark/pull/39968 > DataFrame.createTempView - SparkConnectGrpcException: requirement failed > > > Key: SPARK-42265 > URL: https://issues.apache.org/jira/browse/SPARK-42265 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Takuya Ueshin >Priority: Major > > To reproduce, > ``` > spark.range(1).filter(udf(lambda x: x)("id") >= 0).createTempView("v") > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41820) DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed
[ https://issues.apache.org/jira/browse/SPARK-41820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-41820. --- Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 39968 https://github.com/apache/spark/pull/39968 > DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement > failed > --- > > Key: SPARK-41820 > URL: https://issues.apache.org/jira/browse/SPARK-41820 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Takuya Ueshin >Priority: Major > > {code:java} > >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", > >>> "name"]) > >>> df2 = df.filter(df.age > 3) > >>> df2.createOrReplaceGlobalTempView("people") {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1292, in > pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView > Failed example: > df2.createOrReplaceGlobalTempView("people") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView[3]>", > line 1, in > df2.createOrReplaceGlobalTempView("people") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1192, in createOrReplaceGlobalTempView > self._session.client.execute_command(command) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 625, in _handle_error > raise SparkConnectException(status.message) from None > pyspark.sql.connect.client.SparkConnectException: requirement failed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42402) Support parameterized SQL by sql()
Takuya Ueshin created SPARK-42402: - Summary: Support parameterized SQL by sql() Key: SPARK-42402 URL: https://issues.apache.org/jira/browse/SPARK-42402 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41820) DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed
[ https://issues.apache.org/jira/browse/SPARK-41820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-41820: -- Description: {code:java} >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", >>> "name"]) >>> df2 = df.filter(df.age > 3) >>> df2.createOrReplaceGlobalTempView("people") {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1292, in pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView Failed example: df2.createOrReplaceGlobalTempView("people") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df2.createOrReplaceGlobalTempView("people") File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1192, in createOrReplaceGlobalTempView self._session.client.execute_command(command) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 459, in execute_command self._execute(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 547, in _execute self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectException(status.message) from None pyspark.sql.connect.client.SparkConnectException: requirement failed {code} was: {code:java} >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", >>> "name"]) >>> df.createOrReplaceGlobalTempView("people") {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1292, in pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView Failed example: df2.createOrReplaceGlobalTempView("people") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df2.createOrReplaceGlobalTempView("people") File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1192, in createOrReplaceGlobalTempView self._session.client.execute_command(command) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 459, in execute_command self._execute(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 547, in _execute self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectException(status.message) from None pyspark.sql.connect.client.SparkConnectException: requirement failed {code} > DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement > failed > --- > > Key: SPARK-41820 > URL: https://issues.apache.org/jira/browse/SPARK-41820 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", > >>> "name"]) > >>> df2 = df.filter(df.age > 3) > >>> df2.createOrReplaceGlobalTempView("people") {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1292, in > pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView > Failed example: > df2.createOrReplaceGlobalTempView("people") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView[3]>", > line 1, in > df2.createOrReplaceGlobalTempView("people") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1192, in createOrReplaceGlobalTempView > self._session.client.execute_command(command) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", >
[jira] [Updated] (SPARK-42017) df["bad_key"] does not raise AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42017: -- Parent Issue: SPARK-41282 (was: SPARK-42006) > df["bad_key"] does not raise AnalysisException > -- > > Key: SPARK-42017 > URL: https://issues.apache.org/jira/browse/SPARK-42017 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > e.g.) > {code} > 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > FAILED [ 8%] > pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column) > self = testMethod=test_access_column> > def test_access_column(self): > df = self.df > self.assertTrue(isinstance(df.key, Column)) > self.assertTrue(isinstance(df["key"], Column)) > self.assertTrue(isinstance(df[0], Column)) > self.assertRaises(IndexError, lambda: df[2]) > > self.assertRaises(AnalysisException, lambda: df["bad_key"]) > E AssertionError: AnalysisException not raised by > ../test_column.py:112: AssertionError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42017) df["bad_key"] does not raise AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42017: -- Summary: df["bad_key"] does not raise AnalysisException (was: Different error type AnalysisException vs SparkConnectAnalysisException) > df["bad_key"] does not raise AnalysisException > -- > > Key: SPARK-42017 > URL: https://issues.apache.org/jira/browse/SPARK-42017 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > e.g.) > {code} > 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > FAILED [ 8%] > pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column) > self = testMethod=test_access_column> > def test_access_column(self): > df = self.df > self.assertTrue(isinstance(df.key, Column)) > self.assertTrue(isinstance(df["key"], Column)) > self.assertTrue(isinstance(df[0], Column)) > self.assertRaises(IndexError, lambda: df[2]) > > self.assertRaises(AnalysisException, lambda: df["bad_key"]) > E AssertionError: AnalysisException not raised by > ../test_column.py:112: AssertionError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42338) Different exception in DataFrame.sample
[ https://issues.apache.org/jira/browse/SPARK-42338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42338: -- Environment: (was: It raises {{SparkConnectGrpcException}} instead of {{IllegalArgumentException}}.) > Different exception in DataFrame.sample > --- > > Key: SPARK-42338 > URL: https://issues.apache.org/jira/browse/SPARK-42338 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42338) Different exception in DataFrame.sample
[ https://issues.apache.org/jira/browse/SPARK-42338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42338: -- Description: It raises {{SparkConnectGrpcException}} instead of {{IllegalArgumentException}}. > Different exception in DataFrame.sample > --- > > Key: SPARK-42338 > URL: https://issues.apache.org/jira/browse/SPARK-42338 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > It raises {{SparkConnectGrpcException}} instead of > {{IllegalArgumentException}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42342) Introduce base hierarchy to exceptions.
Takuya Ueshin created SPARK-42342: - Summary: Introduce base hierarchy to exceptions. Key: SPARK-42342 URL: https://issues.apache.org/jira/browse/SPARK-42342 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42340) Implement GroupedData.applyInPandas
Takuya Ueshin created SPARK-42340: - Summary: Implement GroupedData.applyInPandas Key: SPARK-42340 URL: https://issues.apache.org/jira/browse/SPARK-42340 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42338) Different exception in DataFrame.sample
Takuya Ueshin created SPARK-42338: - Summary: Different exception in DataFrame.sample Key: SPARK-42338 URL: https://issues.apache.org/jira/browse/SPARK-42338 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Environment: It raises {{SparkConnectGrpcException}} instead of {{IllegalArgumentException}}. Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42017) Different error type AnalysisException vs SparkConnectAnalysisException
[ https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683594#comment-17683594 ] Takuya Ueshin commented on SPARK-42017: --- The error class hierarchy is one of the issues, but the test in the description has a different issue, {code:python} df["bad_key"] {code} does not raise any error at the point because Spark Connect doesn't analyze whether the column name is valid or not yet. > Different error type AnalysisException vs SparkConnectAnalysisException > --- > > Key: SPARK-42017 > URL: https://issues.apache.org/jira/browse/SPARK-42017 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > e.g.) > {code} > 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > FAILED [ 8%] > pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column) > self = testMethod=test_access_column> > def test_access_column(self): > df = self.df > self.assertTrue(isinstance(df.key, Column)) > self.assertTrue(isinstance(df["key"], Column)) > self.assertTrue(isinstance(df[0], Column)) > self.assertRaises(IndexError, lambda: df[2]) > > self.assertRaises(AnalysisException, lambda: df["bad_key"]) > E AssertionError: AnalysisException not raised by > ../test_column.py:112: AssertionError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42295) Tear down the test cleanly
Takuya Ueshin created SPARK-42295: - Summary: Tear down the test cleanly Key: SPARK-42295 URL: https://issues.apache.org/jira/browse/SPARK-42295 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41778) Add an alias "reduce" to ArrayAggregate
Takuya Ueshin created SPARK-41778: - Summary: Add an alias "reduce" to ArrayAggregate Key: SPARK-41778 URL: https://issues.apache.org/jira/browse/SPARK-41778 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Takuya Ueshin Adds an alias "reduce" to {{ArrayAggregate}}. Presto provides the function: https://prestodb.io/docs/current/functions/array.html#reduce. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41753) Add tests for ArrayZip to check the result size and nullability.
Takuya Ueshin created SPARK-41753: - Summary: Add tests for ArrayZip to check the result size and nullability. Key: SPARK-41753 URL: https://issues.apache.org/jira/browse/SPARK-41753 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.4.0 Reporter: Takuya Ueshin Add tests for {{ArrayZip}} to check the result size and nullability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39419) When the comparator of ArraySort returns null, it should fail.
Takuya Ueshin created SPARK-39419: - Summary: When the comparator of ArraySort returns null, it should fail. Key: SPARK-39419 URL: https://issues.apache.org/jira/browse/SPARK-39419 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Takuya Ueshin When the comparator of {{ArraySort}} returns {{null}}, currently it handles it as {{0}} (equal). According to the doc, {quote} It returns -1, 0, or 1 as the first element is less than, equal to, or greater than the second element. If the comparator function returns other values (including null), the function will fail and raise an error. {quote} It's fine to return non -1, 0, 1 integers to follow the Java convention (still need to update the doc, though), but it should throw an exception for {{null}} result. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39293) The accumulator of ArrayAggregate should copy the intermediate result if string, struct, array, or map
Takuya Ueshin created SPARK-39293: - Summary: The accumulator of ArrayAggregate should copy the intermediate result if string, struct, array, or map Key: SPARK-39293 URL: https://issues.apache.org/jira/browse/SPARK-39293 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.1.2, 3.0.3, 3.3.0 Reporter: Takuya Ueshin The accumulator of ArrayAggregate should copy the intermediate result if string, struct, array, or map. {code:scala} import org.apache.spark.sql.functions._ val reverse = udf((s: String) => s.reverse) val df = Seq(Array("abc", "def")).toDF("array") val testArray = df.withColumn( "agg", aggregate( col("array"), array().cast("array"), (acc, s) => concat(acc, array(reverse(s) aggArray.show(truncate=false) {code} should be: {code} +--+--+ |array |agg | +--+--+ |[abc, def]|[cba, fed]| +--+--+ {code} but: {code} +--+--+ |array |agg | +--+--+ |[abc, def]|[fed, fed]| +--+--+ {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types
[ https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-39048. --- Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 36382 https://github.com/apache/spark/pull/36382 > Refactor `GroupBy._reduce_for_stat_function` on accepted data types > > > Key: SPARK-39048 > URL: https://issues.apache.org/jira/browse/SPARK-39048 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > `Groupby._reduce_for_stat_function` is a common helper function leveraged by > multiple statistical functions of GroupBy objects. > It defines parameters `only_numeric` and `bool_as_numeric` to control > accepted Spark types. > To be consistent with pandas API, we may also have to introduce > `str_as_numeric` for `sum` for example. > Instead of introducing parameters designated for each Spark type, the PR is > proposed to introduce a parameter `accepted_spark_types` to specify accepted > types of Spark columns to be aggregated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
Takuya Ueshin created SPARK-38882: - Summary: The usage logger attachment logic should handle static methods properly. Key: SPARK-38882 URL: https://issues.apache.org/jira/browse/SPARK-38882 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1, 3.3.0 Reporter: Takuya Ueshin The usage logger attachment logic has an issue when handling static methods. For example, {code} $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger ./bin/pyspark {code} {code:python} >>> import pyspark.pandas as ps >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) >>> psdf.from_records([(1, 2), (3, 4)]) A function `DataFrame.from_records(data, index, exclude, columns, coerce_float, nrows)` was failed after 2007.430 ms: 0 Traceback (most recent call last): ... {code} without usage logger: {code:python} >>> import pyspark.pandas as ps >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) >>> psdf.from_records([(1, 2), (3, 4)]) 0 1 0 1 2 1 3 4 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38628) Complete the copy method in subclasses of InternalRow, ArrayData, and MapData to safely copy their instances.
Takuya Ueshin created SPARK-38628: - Summary: Complete the copy method in subclasses of InternalRow, ArrayData, and MapData to safely copy their instances. Key: SPARK-38628 URL: https://issues.apache.org/jira/browse/SPARK-38628 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Takuya Ueshin Some subclasses of {{InternalRow}}, {{ArrayData}}, and {{MapData}} missing support for {{StructType}}, {{ArrayType}}, and {{MapType}} in their {{copy}} method. We should complete them to safely copy their instances. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38484) Move usage logging instrumentation util functions from pandas module to pyspark.util module
[ https://issues.apache.org/jira/browse/SPARK-38484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-38484. --- Fix Version/s: 3.3.0 Assignee: Yihong He Resolution: Fixed Issue resolved by pull request 35790 https://github.com/apache/spark/pull/35790 > Move usage logging instrumentation util functions from pandas module to > pyspark.util module > --- > > Key: SPARK-38484 > URL: https://issues.apache.org/jira/browse/SPARK-38484 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Assignee: Yihong He >Priority: Minor > Fix For: 3.3.0 > > > It will be helpful to attach the usage logger to other modules (e.g. sql) > besides Pandas but other modules should not depend on Pandas modules to use > the instrumentation utils (e.g. _wrap_function, _wrap_property ...). So we > need to move usage logging instrumentation util functions from Pandas module > to pyspark.util module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37491) Fix Series.asof when values of the series is not sorted
[ https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-37491. --- Assignee: pralabhkumar Resolution: Fixed Issue resolved by pull request 35191 https://github.com/apache/spark/pull/35191 > Fix Series.asof when values of the series is not sorted > --- > > Key: SPARK-37491 > URL: https://issues.apache.org/jira/browse/SPARK-37491 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: pralabhkumar >Priority: Major > > https://github.com/apache/spark/pull/34737#discussion_r758223279 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38387) Support `na_action` and Series input correspondence in `Series.map`
[ https://issues.apache.org/jira/browse/SPARK-38387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-38387. --- Fix Version/s: 3.3.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 35706 https://github.com/apache/spark/pull/35706 > Support `na_action` and Series input correspondence in `Series.map` > --- > > Key: SPARK-38387 > URL: https://issues.apache.org/jira/browse/SPARK-38387 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Support `na_action` and Series input correspondence in `Series.map`, in order > to reach parity to pandas API. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37903) Replace string_typehints with get_type_hints.
Takuya Ueshin created SPARK-37903: - Summary: Replace string_typehints with get_type_hints. Key: SPARK-37903 URL: https://issues.apache.org/jira/browse/SPARK-37903 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Currently we have a hacky way to resolve type hints written as strings, but we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37885) Allow pandas_udf to take type annotations with future annotations enabled
Takuya Ueshin created SPARK-37885: - Summary: Allow pandas_udf to take type annotations with future annotations enabled Key: SPARK-37885 URL: https://issues.apache.org/jira/browse/SPARK-37885 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin When using {{{}from __future__ import annotations{}}}, the type hints will be all strings, then pandas UDF type inference won't work as follows: {code:python} >>> from __future__ import annotations >>> from typing import Union >>> import pandas as pd >>> from pyspark.sql.functions import pandas_udf >>> @pandas_udf("long") ... def plus_one(v: Union[pd.Series, pd.DataFrame]) -> pd.Series: ... return v + 1 Traceback (most recent call last): ... NotImplementedError: Unsupported signature: (v: 'Union[pd.Series, pd.DataFrame]') -> 'pd.Series'. {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37782) Make DataFrame.transform take the parameters for the function.
Takuya Ueshin created SPARK-37782: - Summary: Make DataFrame.transform take the parameters for the function. Key: SPARK-37782 URL: https://issues.apache.org/jira/browse/SPARK-37782 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Currently when a function which takes parameters besides DataFrame is passed to {{DataFrame.transform}}, {{lambda}} needs to be used. Making {{DataFrame.transform}} take the parameters would be more convenient. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37678) Incorrect annotations in SeriesGroupBy._cleanup_and_return
[ https://issues.apache.org/jira/browse/SPARK-37678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-37678. --- Fix Version/s: 3.2.1 3.3.0 Assignee: Maciej Szymkiewicz Resolution: Fixed Issue resolved by pull request 34950 https://github.com/apache/spark/pull/34950 > Incorrect annotations in SeriesGroupBy._cleanup_and_return > --- > > Key: SPARK-37678 > URL: https://issues.apache.org/jira/browse/SPARK-37678 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > [{{SeriesGroupBy._cleanup_and_return}}|https://github.com/apache/spark/blob/02ee1ae10b938eaa1621c3e878d07c39b9887c2e/python/pyspark/pandas/groupby.py#L2997-L2998] > annotations > {code:python} > def _cleanup_and_return(self, pdf: pd.DataFrame) -> Series: > return first_series(pdf).rename().rename(self._psser.name) > {code} > are inconsistent: > - If {{pdf}} is {{pd.DataFrame}} then output should be {{pd.Series}}. > - If output is {{ps.Series}} then {{pdf}} should be {{ps.DataFrame}}. > Doesn't seem like the method is used (it is possible that my search skills > and PyCharm inspection failed), so I am not sure which of these options was > intended. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37678) Incorrect annotations in SeriesGroupBy._cleanup_and_return
[ https://issues.apache.org/jira/browse/SPARK-37678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461681#comment-17461681 ] Takuya Ueshin commented on SPARK-37678: --- Yes! > Incorrect annotations in SeriesGroupBy._cleanup_and_return > --- > > Key: SPARK-37678 > URL: https://issues.apache.org/jira/browse/SPARK-37678 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > [{{SeriesGroupBy._cleanup_and_return}}|https://github.com/apache/spark/blob/02ee1ae10b938eaa1621c3e878d07c39b9887c2e/python/pyspark/pandas/groupby.py#L2997-L2998] > annotations > {code:python} > def _cleanup_and_return(self, pdf: pd.DataFrame) -> Series: > return first_series(pdf).rename().rename(self._psser.name) > {code} > are inconsistent: > - If {{pdf}} is {{pd.DataFrame}} then output should be {{pd.Series}}. > - If output is {{ps.Series}} then {{pdf}} should be {{ps.DataFrame}}. > Doesn't seem like the method is used (it is possible that my search skills > and PyCharm inspection failed), so I am not sure which of these options was > intended. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37678) Incorrect annotations in SeriesGroupBy._cleanup_and_return
[ https://issues.apache.org/jira/browse/SPARK-37678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461677#comment-17461677 ] Takuya Ueshin edited comment on SPARK-37678 at 12/17/21, 9:30 PM: -- Good catch! It must be {{{}_cleanup_and_return(self, psdf: DataFrame) -> Series{}}}. (not {{pd.}}) ??Doesn't seem like the method is used?? It's an actual implementation of an abstract method {{GroupBy._cleanup_and_return}} for {{{}SeriesGroupBy{}}}. {{GroupBy._cleanup_and_return}} is called at many places in {{{}GroupBy{}}}. was (Author: ueshin): Good catch! It must be {{{}_cleanup_and_return(self, psdf: DataFrame) -> Series{}}}. ??Doesn't seem like the method is used?? It's an actual implementation of an abstract method {{GroupBy._cleanup_and_return}} for {{{}SeriesGroupBy{}}}. {{GroupBy._cleanup_and_return}} is called at many places in {{{}GroupBy{}}}. > Incorrect annotations in SeriesGroupBy._cleanup_and_return > --- > > Key: SPARK-37678 > URL: https://issues.apache.org/jira/browse/SPARK-37678 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > [{{SeriesGroupBy._cleanup_and_return}}|https://github.com/apache/spark/blob/02ee1ae10b938eaa1621c3e878d07c39b9887c2e/python/pyspark/pandas/groupby.py#L2997-L2998] > annotations > {code:python} > def _cleanup_and_return(self, pdf: pd.DataFrame) -> Series: > return first_series(pdf).rename().rename(self._psser.name) > {code} > are inconsistent: > - If {{pdf}} is {{pd.DataFrame}} then output should be {{pd.Series}}. > - If output is {{ps.Series}} then {{pdf}} should be {{ps.DataFrame}}. > Doesn't seem like the method is used (it is possible that my search skills > and PyCharm inspection failed), so I am not sure which of these options was > intended. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37678) Incorrect annotations in SeriesGroupBy._cleanup_and_return
[ https://issues.apache.org/jira/browse/SPARK-37678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461677#comment-17461677 ] Takuya Ueshin commented on SPARK-37678: --- Good catch! It must be {{{}_cleanup_and_return(self, psdf: DataFrame) -> Series{}}}. ??Doesn't seem like the method is used?? It's an actual implementation of an abstract method {{GroupBy._cleanup_and_return}} for {{{}SeriesGroupBy{}}}. {{GroupBy._cleanup_and_return}} is called at many places in {{{}GroupBy{}}}. > Incorrect annotations in SeriesGroupBy._cleanup_and_return > --- > > Key: SPARK-37678 > URL: https://issues.apache.org/jira/browse/SPARK-37678 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > [{{SeriesGroupBy._cleanup_and_return}}|https://github.com/apache/spark/blob/02ee1ae10b938eaa1621c3e878d07c39b9887c2e/python/pyspark/pandas/groupby.py#L2997-L2998] > annotations > {code:python} > def _cleanup_and_return(self, pdf: pd.DataFrame) -> Series: > return first_series(pdf).rename().rename(self._psser.name) > {code} > are inconsistent: > - If {{pdf}} is {{pd.DataFrame}} then output should be {{pd.Series}}. > - If output is {{ps.Series}} then {{pdf}} should be {{ps.DataFrame}}. > Doesn't seem like the method is used (it is possible that my search skills > and PyCharm inspection failed), so I am not sure which of these options was > intended. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37669) Remove unnecessary usages of OrderedDict
[ https://issues.apache.org/jira/browse/SPARK-37669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461057#comment-17461057 ] Takuya Ueshin commented on SPARK-37669: --- I'm working on this. > Remove unnecessary usages of OrderedDict > > > Key: SPARK-37669 > URL: https://issues.apache.org/jira/browse/SPARK-37669 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that supported Python is 3.7 and above, we can remove unnecessary usages > of {{OrderedDict}} because built-in dict guarantees the insertion order. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37669) Remove unnecessary usages of OrderedDict
Takuya Ueshin created SPARK-37669: - Summary: Remove unnecessary usages of OrderedDict Key: SPARK-37669 URL: https://issues.apache.org/jira/browse/SPARK-37669 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Now that supported Python is 3.7 and above, we can remove unnecessary usages of {{OrderedDict}} because built-in dict guarantees the insertion order. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37514) Remove workarounds due to older pandas
Takuya Ueshin created SPARK-37514: - Summary: Remove workarounds due to older pandas Key: SPARK-37514 URL: https://issues.apache.org/jira/browse/SPARK-37514 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Now that we upgraded the minimum version of pandas to {{1.0.5}}. We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37443) Provide a profiler for Python/Pandas UDFs
Takuya Ueshin created SPARK-37443: - Summary: Provide a profiler for Python/Pandas UDFs Key: SPARK-37443 URL: https://issues.apache.org/jira/browse/SPARK-37443 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Currently a profiler is provided for only {{RDD}} operations, but providing a profiler for Python/Pandas UDFs would be great. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37374) StatCounter should use mergeStats when merging with self.
Takuya Ueshin created SPARK-37374: - Summary: StatCounter should use mergeStats when merging with self. Key: SPARK-37374 URL: https://issues.apache.org/jira/browse/SPARK-37374 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 3.3.0 Reporter: Takuya Ueshin {{StatCounter}} should use {{mergeStats}} instead of {{merge}} when merging with {{self}}. This is a long standing bug but usually this bug won't be hit unless users explicitly use {{mergeStats}} with {{self}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37298) Use unique exprId in RewriteAsOfJoin
[ https://issues.apache.org/jira/browse/SPARK-37298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-37298. --- Fix Version/s: 3.3.0 Assignee: Allison Wang Resolution: Fixed Issue resolved by pull request 34567 https://github.com/apache/spark/pull/34567 > Use unique exprId in RewriteAsOfJoin > > > Key: SPARK-37298 > URL: https://issues.apache.org/jira/browse/SPARK-37298 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.3.0 > > > Use a new exprId instead of reusing an old exprId in RewriteAsOfJoin to help > guarantee plan integrity and eliminate potential issues with exprId reuse. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37296) Add missing type hints in python/pyspark/util.py
Takuya Ueshin created SPARK-37296: - Summary: Add missing type hints in python/pyspark/util.py Key: SPARK-37296 URL: https://issues.apache.org/jira/browse/SPARK-37296 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36845) Inline type hint files for files in python/pyspark/sql
[ https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-36845: -- Summary: Inline type hint files for files in python/pyspark/sql (was: Inline type hint files) > Inline type hint files for files in python/pyspark/sql > -- > > Key: SPARK-36845 > URL: https://issues.apache.org/jira/browse/SPARK-36845 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > > Currently there are type hint stub files ({{*.pyi}}) to show the expected > types for functions, but we can also take advantage of static type checking > within the functions by inlining the type hints. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36845) Inline type hint files
[ https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432684#comment-17432684 ] Takuya Ueshin commented on SPARK-36845: --- Hi [~dchvn], shall we file separate umbrella tickets for each module and resolve this? The number of sub tasks are already growing for one umbrella ticket. Managing tasks based on each module should be clearer. Thanks! > Inline type hint files > -- > > Key: SPARK-36845 > URL: https://issues.apache.org/jira/browse/SPARK-36845 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > > Currently there are type hint stub files ({{*.pyi}}) to show the expected > types for functions, but we can also take advantage of static type checking > within the functions by inlining the type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37079) Fix DataFrameWriterV2.partitionedBy to send the arguments to JVM properly
Takuya Ueshin created SPARK-37079: - Summary: Fix DataFrameWriterV2.partitionedBy to send the arguments to JVM properly Key: SPARK-37079 URL: https://issues.apache.org/jira/browse/SPARK-37079 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.2.0, 3.1.2, 3.3.0 Reporter: Takuya Ueshin In PySpark, {{DataFrameWriterV2.partitionedBy}} doesn't send the arguments to JVM properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37048) Clean up inlining type hints under SQL module
[ https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-37048. --- Fix Version/s: 3.3.0 Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 34318 https://github.com/apache/spark/pull/34318 > Clean up inlining type hints under SQL module > - > > Key: SPARK-37048 > URL: https://issues.apache.org/jira/browse/SPARK-37048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.3.0 > > > Now that most of type hits under the SQL module are inlined. > We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36945) Inline type hints for python/pyspark/sql/udf.py
[ https://issues.apache.org/jira/browse/SPARK-36945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36945. --- Fix Version/s: 3.3.0 Assignee: dch nguyen Resolution: Fixed Issue resolved by pull request 34289 https://github.com/apache/spark/pull/34289 > Inline type hints for python/pyspark/sql/udf.py > --- > > Key: SPARK-36945 > URL: https://issues.apache.org/jira/browse/SPARK-36945 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37048) Clean up inlining type hints under SQL module
[ https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430245#comment-17430245 ] Takuya Ueshin commented on SPARK-37048: --- I'm working on this. > Clean up inlining type hints under SQL module > - > > Key: SPARK-37048 > URL: https://issues.apache.org/jira/browse/SPARK-37048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that most of type hits under the SQL module are inlined. > We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37048) Clean up inlining type hints under SQL module
Takuya Ueshin created SPARK-37048: - Summary: Clean up inlining type hints under SQL module Key: SPARK-37048 URL: https://issues.apache.org/jira/browse/SPARK-37048 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Now that most of type hits under the SQL module are inlined. We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36886) Inline type hints for python/pyspark/sql/context.py
[ https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36886. --- Fix Version/s: 3.3.0 Assignee: dch nguyen Resolution: Fixed Issue resolved by pull request 34185 https://github.com/apache/spark/pull/34185 > Inline type hints for python/pyspark/sql/context.py > --- > > Key: SPARK-36886 > URL: https://issues.apache.org/jira/browse/SPARK-36886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/context.py from Inline type hints > for python/pyspark/sql/context.pyi. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36910) Inline type hints for python/pyspark/sql/types.py
[ https://issues.apache.org/jira/browse/SPARK-36910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36910. --- Fix Version/s: 3.3.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 34174 https://github.com/apache/spark/pull/34174 > Inline type hints for python/pyspark/sql/types.py > - > > Key: SPARK-36910 > URL: https://issues.apache.org/jira/browse/SPARK-36910 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/types.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org