[jira] [Assigned] (SPARK-28281) Convert and port 'having.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28281: Assignee: Apache Spark > Convert and port 'having.sql' into UDF test base > > > Key: SPARK-28281 > URL: https://issues.apache.org/jira/browse/SPARK-28281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28281) Convert and port 'having.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28281: Assignee: (was: Apache Spark) > Convert and port 'having.sql' into UDF test base > > > Key: SPARK-28281 > URL: https://issues.apache.org/jira/browse/SPARK-28281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28252) local/global temp view should not accept duplicate column names
[ https://issues.apache.org/jira/browse/SPARK-28252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881766#comment-16881766 ] Yuming Wang commented on SPARK-28252: - PostgreSQL also do not support it. {code:sql} postgres=# CREATE TEMPORARY VIEW spark_28252 as select 1 as c1, 2 as c1; ERROR: column "c1" specified more than once {code} > local/global temp view should not accept duplicate column names > --- > > Key: SPARK-28252 > URL: https://issues.apache.org/jira/browse/SPARK-28252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > scala> spark.sql("create temp view v1 as select 1 as col1, 2 as col1") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("select col1 from v1").show > 19/07/04 22:27:19 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > org.apache.spark.sql.AnalysisException: Reference 'col1' is ambiguous, could > be: v1.col1, v1.col1.; line 1 pos 7 > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:892) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28289) Convert and port 'union.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881763#comment-16881763 ] Hyukjin Kwon commented on SPARK-28289: -- Please go ahead. > Convert and port 'union.sql' into UDF test base > --- > > Key: SPARK-28289 > URL: https://issues.apache.org/jira/browse/SPARK-28289 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28289) Convert and port 'union.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881750#comment-16881750 ] Yiheng Wang commented on SPARK-28289: - Hi [~hyukjin.kwon], I'll be working on this. Thanks. > Convert and port 'union.sql' into UDF test base > --- > > Key: SPARK-28289 > URL: https://issues.apache.org/jira/browse/SPARK-28289 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28324) The LOG function using 10 as the base, but Spark using E
[ https://issues.apache.org/jira/browse/SPARK-28324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881748#comment-16881748 ] Yuming Wang commented on SPARK-28324: - PostgreSQL, Vertica, Teradata using 10 as the base. DB2, SQL Server, Hive and MySQL using E as the base. > The LOG function using 10 as the base, but Spark using E > > > Key: SPARK-28324 > URL: https://issues.apache.org/jira/browse/SPARK-28324 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> select log(10); > 2.302585092994046 > {code} > PostgreSQL: > {code:sql} > postgres=# select log(10); > log > - >1 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28312) Add numeric.sql
[ https://issues.apache.org/jira/browse/SPARK-28312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28312: Assignee: Apache Spark > Add numeric.sql > --- > > Key: SPARK-28312 > URL: https://issues.apache.org/jira/browse/SPARK-28312 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28312) Add numeric.sql
[ https://issues.apache.org/jira/browse/SPARK-28312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28312: Assignee: (was: Apache Spark) > Add numeric.sql > --- > > Key: SPARK-28312 > URL: https://issues.apache.org/jira/browse/SPARK-28312 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881739#comment-16881739 ] Hyukjin Kwon commented on SPARK-28283: -- Thanks. [~imback82] > Convert and port 'intersect-all.sql' into UDF test base > --- > > Key: SPARK-28283 > URL: https://issues.apache.org/jira/browse/SPARK-28283 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881738#comment-16881738 ] Hyukjin Kwon commented on SPARK-28284: -- Yea, or we can add some conditions on {{ON}} that returns \{{true}}. > Convert and port 'join-empty-relation.sql' into UDF test base > - > > Key: SPARK-28284 > URL: https://issues.apache.org/jira/browse/SPARK-28284 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code:java} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. Please see [this comment|https://github.com/apache/spark/pull/25090#discussion_r301880585] when you file a JIRA. It's more than perfect if you are even able to fix an issue found but this can be done separately. There is a great example to check and follow at SPARK-28323, done by [~viirya] 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains other minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See [https://github.com/apache/spark/pull/25069] as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See [https://github.com/apache/spark/pull/24752] if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/r
[jira] [Commented] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881732#comment-16881732 ] Terry Kim commented on SPARK-28283: --- I will work on this. > Convert and port 'intersect-all.sql' into UDF test base > --- > > Key: SPARK-28283 > URL: https://issues.apache.org/jira/browse/SPARK-28283 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code:java} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. Please see [this comment|https://github.com/apache/spark/pull/25090#discussion_r301880585] when you file a JIRA. It's more than perfect if you are even able to fix it but this can be done separately. There is a great example to check and follow at SPARK-28323, done by [~viirya] 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains other minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See [https://github.com/apache/spark/pull/25069] as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See [https://github.com/apache/spark/pull/24752] if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-
[jira] [Commented] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881731#comment-16881731 ] Terry Kim commented on SPARK-28284: --- join-empty-relation.sql has the following: {code:java} SELECT * FROM t1 INNER JOIN empty_table; SELECT * FROM t1 CROSS JOIN empty_table; SELECT * FROM t1 LEFT OUTER JOIN empty_table; SELECT * FROM t1 RIGHT OUTER JOIN empty_table; SELECT * FROM t1 FULL OUTER JOIN empty_table; SELECT * FROM t1 LEFT SEMI JOIN empty_table; SELECT * FROM t1 LEFT ANTI JOIN empty_table; SELECT * FROM empty_table INNER JOIN t1; SELECT * FROM empty_table CROSS JOIN t1; SELECT * FROM empty_table LEFT OUTER JOIN t1; SELECT * FROM empty_table RIGHT OUTER JOIN t1; SELECT * FROM empty_table FULL OUTER JOIN t1; SELECT * FROM empty_table LEFT SEMI JOIN t1; SELECT * FROM empty_table LEFT ANTI JOIN t1; SELECT * FROM empty_table INNER JOIN empty_table; SELECT * FROM empty_table CROSS JOIN empty_table; SELECT * FROM empty_table LEFT OUTER JOIN empty_table; SELECT * FROM empty_table RIGHT OUTER JOIN empty_table; SELECT * FROM empty_table FULL OUTER JOIN empty_table; SELECT * FROM empty_table LEFT SEMI JOIN empty_table; SELECT * FROM empty_table LEFT ANTI JOIN empty_table; {code} Where can I put `udf`? Do you want to modify SELECT clause? > Convert and port 'join-empty-relation.sql' into UDF test base > - > > Key: SPARK-28284 > URL: https://issues.apache.org/jira/browse/SPARK-28284 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27923) List all cases that PostgreSQL throws an exception but Spark SQL is NULL
[ https://issues.apache.org/jira/browse/SPARK-27923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27923: Description: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.out#L9-L10]. When porting the [float4.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql] found three case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L43-L74]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L107-L118]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L239-L241]. When porting the [float8.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float8.sql] found five case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L34-L65]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L38-L41]. # Cannot take logarithm of zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L439-L440]. # Cannot take logarithm of a negative number [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L441-L442]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L445-L446]. When porting the [numeric.sql|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql] found six case: # Invalid decimal [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L689-L696]. # decimal type can not accept [Infinity and -Infinity|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L718-L731]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1429-L1460]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1883-L1887]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1940-L1945]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1987-L1998]. was: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2
[jira] [Created] (SPARK-28324) The LOG function using 10 as the base, but Spark using E
Yuming Wang created SPARK-28324: --- Summary: The LOG function using 10 as the base, but Spark using E Key: SPARK-28324 URL: https://issues.apache.org/jira/browse/SPARK-28324 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Spark SQL: {code:sql} spark-sql> select log(10); 2.302585092994046 {code} PostgreSQL: {code:sql} postgres=# select log(10); log - 1 (1 row) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27923) List all cases that PostgreSQL throws an exception but Spark SQL is NULL
[ https://issues.apache.org/jira/browse/SPARK-27923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27923: Description: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.out#L9-L10]. When porting the [float4.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql] found three case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L43-L74]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L107-L118]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L239-L241]. When porting the [float8.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float8.sql] found five case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L34-L65]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L38-L41]. # Cannot take logarithm of zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L439-L440]. # Cannot take logarithm of a negative number [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L441-L442]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L445-L446]. When porting the [numeric.sql|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql] found five case: # Invalid decimal [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L689-L696]. # decimal type can not accept [Infinity and -Infinity|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L718-L731]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1429-L1460]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1883-L1887]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1940-L1945]. was: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.
[jira] [Updated] (SPARK-27923) List all cases that PostgreSQL throws an exception but Spark SQL is NULL
[ https://issues.apache.org/jira/browse/SPARK-27923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27923: Description: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.out#L9-L10]. When porting the [float4.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql] found three case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L43-L74]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L107-L118]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L239-L241]. When porting the [float8.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float8.sql] found five case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L34-L65]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L38-L41]. # Cannot take logarithm of zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L439-L440]. # Cannot take logarithm of a negative number [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L441-L442]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L445-L446]. When porting the [numeric.sql|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql] found three case: # Invalid decimal [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L689-L696]. # decimal type can not accept [Infinity and -Infinity|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L718-L731]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1429-L1460]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1883-L1887]. was: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.out#L9-L10]. When porting the [float4.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql] found three
[jira] [Updated] (SPARK-27923) List all cases that PostgreSQL throws an exception but Spark SQL is NULL
[ https://issues.apache.org/jira/browse/SPARK-27923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27923: Description: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.out#L9-L10]. When porting the [float4.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql] found three case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L43-L74]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L107-L118]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L239-L241]. When porting the [float8.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float8.sql] found five case: # Bad input [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L34-L65]. # Bad special inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L38-L41]. # Cannot take logarithm of zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L439-L440]. # Cannot take logarithm of a negative number [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float8.out#L441-L442]. # Divide by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out#L445-L446]. When porting the [numeric.sql|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql] found four case: # Invalid decimal [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L689-L696]. # decimal type can not accept [Infinity and -Infinity|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L718-L731]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1429-L1460]. # Invalid inputs [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1883-L1887]. was: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. When porting the [date.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/date.sql] found a case: # Invalid date [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/date.out#L13-L14]. When porting the [int2.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int2.sql] found a case: # Invalid short [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int2.out#L9-L10]. When porting the [float4.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql] found three c
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code:java} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. Please see [this comment|https://github.com/apache/spark/pull/25090#discussion_r301880585] when you file a JIRA. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains other minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See [https://github.com/apache/spark/pull/25069] as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See [https://github.com/apache/spark/pull/24752] if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure y
[jira] [Commented] (SPARK-28323) PythonUDF should be able to use in join condition
[ https://issues.apache.org/jira/browse/SPARK-28323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881717#comment-16881717 ] Liang-Chi Hsieh commented on SPARK-28323: - I found this bug when doing SPARK-28276. > PythonUDF should be able to use in join condition > - > > Key: SPARK-28323 > URL: https://issues.apache.org/jira/browse/SPARK-28323 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > There is a bug in {{ExtractPythonUDFs}} that produces wrong result > attributes. It causes a failure when using PythonUDFs among multiple child > plans, e.g., join. An example is using PythonUDFs in join condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28323) PythonUDF should be able to use in join condition
[ https://issues.apache.org/jira/browse/SPARK-28323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28323: Assignee: (was: Apache Spark) > PythonUDF should be able to use in join condition > - > > Key: SPARK-28323 > URL: https://issues.apache.org/jira/browse/SPARK-28323 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > There is a bug in {{ExtractPythonUDFs}} that produces wrong result > attributes. It causes a failure when using PythonUDFs among multiple child > plans, e.g., join. An example is using PythonUDFs in join condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28323) PythonUDF should be able to use in join condition
[ https://issues.apache.org/jira/browse/SPARK-28323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28323: Assignee: Apache Spark > PythonUDF should be able to use in join condition > - > > Key: SPARK-28323 > URL: https://issues.apache.org/jira/browse/SPARK-28323 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > There is a bug in {{ExtractPythonUDFs}} that produces wrong result > attributes. It causes a failure when using PythonUDFs among multiple child > plans, e.g., join. An example is using PythonUDFs in join condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28323) PythonUDF should be able to use in join condition
Liang-Chi Hsieh created SPARK-28323: --- Summary: PythonUDF should be able to use in join condition Key: SPARK-28323 URL: https://issues.apache.org/jira/browse/SPARK-28323 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh There is a bug in {{ExtractPythonUDFs}} that produces wrong result attributes. It causes a failure when using PythonUDFs among multiple child plans, e.g., join. An example is using PythonUDFs in join condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28073) ANSI SQL: Character literals
[ https://issues.apache.org/jira/browse/SPARK-28073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-28073: --- Comment: was deleted (was: I'm working on.) > ANSI SQL: Character literals > > > Key: SPARK-28073 > URL: https://issues.apache.org/jira/browse/SPARK-28073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Feature ID||Feature Name||Feature Description|| > |E021-03|Character literals|— Subclause 5.3, “”: [ > ... ] | > Example: > {code:sql} > SELECT 'first line' > ' - next line' > ' - third line' > AS "Three lines to one"; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28136) Add int8.sql
[ https://issues.apache.org/jira/browse/SPARK-28136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28136: - Assignee: Yuming Wang > Add int8.sql > > > Key: SPARK-28136 > URL: https://issues.apache.org/jira/browse/SPARK-28136 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int8.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28136) Port int8.sql
[ https://issues.apache.org/jira/browse/SPARK-28136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28136: -- Summary: Port int8.sql (was: Add int8.sql) > Port int8.sql > - > > Key: SPARK-28136 > URL: https://issues.apache.org/jira/browse/SPARK-28136 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int8.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28136) Add int8.sql
[ https://issues.apache.org/jira/browse/SPARK-28136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28136. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24933 [https://github.com/apache/spark/pull/24933] > Add int8.sql > > > Key: SPARK-28136 > URL: https://issues.apache.org/jira/browse/SPARK-28136 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int8.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28278: Assignee: Apache Spark > Convert and port 'except-all.sql' into UDF test base > > > Key: SPARK-28278 > URL: https://issues.apache.org/jira/browse/SPARK-28278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28278: Assignee: (was: Apache Spark) > Convert and port 'except-all.sql' into UDF test base > > > Key: SPARK-28278 > URL: https://issues.apache.org/jira/browse/SPARK-28278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28322) DIV support decimal type
[ https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881694#comment-16881694 ] Yuming Wang commented on SPARK-28322: - {{DIV}} and {{/}} are a little different: {code:sql} select 12345678901234567890 / 123; ?column? 100371373180768845 (1 row) select div(12345678901234567890, 123); div 100371373180768844 (1 row) {code} [https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out#L1564-L1574] > DIV support decimal type > > > Key: SPARK-28322 > URL: https://issues.apache.org/jira/browse/SPARK-28322 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS > DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div > CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 > pos 7; > 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as > decimal(10,0))), None)] > +- OneRowRelation > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > div > - >3 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28275) Convert and port 'count.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28275: Assignee: (was: Apache Spark) > Convert and port 'count.sql' into UDF test base > --- > > Key: SPARK-28275 > URL: https://issues.apache.org/jira/browse/SPARK-28275 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28275) Convert and port 'count.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28275: Assignee: Apache Spark > Convert and port 'count.sql' into UDF test base > --- > > Key: SPARK-28275 > URL: https://issues.apache.org/jira/browse/SPARK-28275 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28322) DIV support decimal type
[ https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881691#comment-16881691 ] Yuming Wang commented on SPARK-28322: - cc [~mgaido] > DIV support decimal type > > > Key: SPARK-28322 > URL: https://issues.apache.org/jira/browse/SPARK-28322 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS > DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div > CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 > pos 7; > 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as > decimal(10,0))), None)] > +- OneRowRelation > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > div > - >3 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28322) DIV support decimal type
Yuming Wang created SPARK-28322: --- Summary: DIV support decimal type Key: SPARK-28322 URL: https://issues.apache.org/jira/browse/SPARK-28322 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Spark SQL: {code:sql} spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 pos 7; 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as decimal(10,0))), None)] +- OneRowRelation {code} PostgreSQL: {code:sql} postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); div - 3 (1 row) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881689#comment-16881689 ] Hyukjin Kwon commented on SPARK-28288: -- Please go ahead. > Convert and port 'window.sql' into UDF test base > > > Key: SPARK-28288 > URL: https://issues.apache.org/jira/browse/SPARK-28288 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881675#comment-16881675 ] YoungGyu Chun commented on SPARK-28288: --- Hello [~hyukjin.kwon], I'll be working on this. Thank you. > Convert and port 'window.sql' into UDF test base > > > Key: SPARK-28288 > URL: https://issues.apache.org/jira/browse/SPARK-28288 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28274) Convert and port 'pgSQL/window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881671#comment-16881671 ] Hyukjin Kwon commented on SPARK-28274: -- We should wait for SPARK-23160 > Convert and port 'pgSQL/window.sql' into UDF test base > -- > > Key: SPARK-28274 > URL: https://issues.apache.org/jira/browse/SPARK-28274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-23160 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28274) Convert and port 'pgSQL/window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881670#comment-16881670 ] Hyukjin Kwon commented on SPARK-28274: -- Oops, seems like it was my mistake. > Convert and port 'pgSQL/window.sql' into UDF test base > -- > > Key: SPARK-28274 > URL: https://issues.apache.org/jira/browse/SPARK-28274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-23160 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28274) Convert and port 'pgSQL/window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881664#comment-16881664 ] Terry Kim commented on SPARK-28274: --- [~hyukjin.kwon] I don't see the window.sql file under sql/core/src/test/resources/sql-tests/inputs/pgSQL/. > Convert and port 'pgSQL/window.sql' into UDF test base > -- > > Key: SPARK-28274 > URL: https://issues.apache.org/jira/browse/SPARK-28274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-23160 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27889) Make development scripts under dev/ support Python 3
[ https://issues.apache.org/jira/browse/SPARK-27889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881663#comment-16881663 ] Weichen Xu commented on SPARK-27889: Discussed with [~mengxr] offline. I will work on this. > Make development scripts under dev/ support Python 3 > > > Key: SPARK-27889 > URL: https://issues.apache.org/jira/browse/SPARK-27889 > Project: Spark > Issue Type: Sub-task > Components: Build, Deploy >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiao Li >Priority: Major > > Some of our internal python scripts under dev/ only support Python 2. With > deprecation of Python 2, we should make those scripts support Python 3. So > developers have a way to avoid seeing the deprecation warning. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25382) Remove ImageSchema.readImages in 3.0
[ https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881662#comment-16881662 ] Weichen Xu commented on SPARK-25382: I will work on this. Thank! > Remove ImageSchema.readImages in 3.0 > > > Key: SPARK-25382 > URL: https://issues.apache.org/jira/browse/SPARK-25382 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > A follow-up task from SPARK-25345. We might need to support sampling > (SPARK-25383) in order to remove readImages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28316) Decimal precision issue
[ https://issues.apache.org/jira/browse/SPARK-28316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881648#comment-16881648 ] Yuming Wang commented on SPARK-28316: - cc [~joshrosen] [~cloud_fan] [~Gengliang.Wang] > Decimal precision issue > --- > > Key: SPARK-28316 > URL: https://issues.apache.org/jira/browse/SPARK-28316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Multiply check: > {code:sql} > -- Spark SQL > spark-sql> select cast(-34338492.215397047 as decimal(38, 10)) * > cast(-34338492.215397047 as decimal(38, 10)); > 1179132047626883.596862 > -- PostgreSQL > postgres=# select cast(-34338492.215397047 as numeric(38, 10)) * > cast(-34338492.215397047 as numeric(38, 10)); >?column? > --- > 1179132047626883.59686213585632020900 > (1 row) > {code} > Division check: > {code:sql} > -- Spark SQL > spark-sql> select cast(93901.57763026 as decimal(38, 10)) / cast(4.31 as > decimal(38, 10)); > 21786.908963 > -- PostgreSQL > postgres=# select cast(93901.57763026 as numeric(38, 10)) / cast(4.31 as > numeric(38, 10)); > ?column? > > 21786.908962937355 > (1 row) > {code} > POWER(10, LN(value)) check: > {code:sql} > -- Spark SQL > spark-sql> SELECT CAST(POWER(cast('10' as decimal(38, 18)), > LN(ABS(round(cast(-24926804.04504742 as decimal(38, 10)),200 AS > decimal(38, 10)); > 107511333880051856 > -- PostgreSQL > postgres=# SELECT CAST(POWER(cast('10' as numeric(38, 18)), > LN(ABS(round(cast(-24926804.04504742 as numeric(38, 10)),200 AS > numeric(38, 10)); > power > --- > 107511333880052007.0414112467 > (1 row) > {code} > AVG, STDDEV and VARIANCE returns double type: > {code:sql} > -- Spark SQL > spark-sql> create temporary view t1 as select * from values > > (cast(-24926804.04504742 as decimal(38, 10))), > > (cast(16397.038491 as decimal(38, 10))), > > (cast(7799461.4119 as decimal(38, 10))) > > as t1(t); > spark-sql> SELECT AVG(t), STDDEV(t), VARIANCE(t) FROM t1; > -5703648.53155214 1.7096528995154984E72.922913036821751E14 > -- PostgreSQL > postgres=# SELECT AVG(t), STDDEV(t), VARIANCE(t) from (values > (cast(-24926804.04504742 as decimal(38, 10))), (cast(16397.038491 as > decimal(38, 10))), (cast(7799461.4119 as decimal(38, 10 t1(t); > avg |stddev | > variance > ---+---+-- > -5703648.53155214 | 17096528.99515498420743029415 | > 292291303682175.094017569588 > (1 row) > {code} > EXP returns double type: > {code:sql} > -- Spark SQL > spark-sql> select exp(cast(1.0 as decimal(31,30))); > 2.718281828459045 > -- PostgreSQL > postgres=# select exp(cast(1.0 as decimal(31,30))); >exp > -- > 2.718281828459045235360287471353 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881645#comment-16881645 ] Terry Kim commented on SPARK-28278: --- I will work on this. > Convert and port 'except-all.sql' into UDF test base > > > Key: SPARK-28278 > URL: https://issues.apache.org/jira/browse/SPARK-28278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code:java} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple {{udf(...)}}s into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains other minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See [https://github.com/apache/spark/pull/25069] as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See [https://github.com/apache/spark/pull/24752] if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import p
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code:java} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple \{{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains other minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See [https://github.com/apache/spark/pull/25069] as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See [https://github.com/apache/spark/pull/24752] if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import p
[jira] [Updated] (SPARK-28281) Convert and port 'having.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28281: - Component/s: Tests PySpark > Convert and port 'having.sql' into UDF test base > > > Key: SPARK-28281 > URL: https://issues.apache.org/jira/browse/SPARK-28281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28286) Convert and port 'pivot.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28286: - Component/s: Tests PySpark > Convert and port 'pivot.sql' into UDF test base > --- > > Key: SPARK-28286 > URL: https://issues.apache.org/jira/browse/SPARK-28286 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28280) Convert and port 'group-by.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28280: - Component/s: Tests PySpark > Convert and port 'group-by.sql' into UDF test base > -- > > Key: SPARK-28280 > URL: https://issues.apache.org/jira/browse/SPARK-28280 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28278: - Component/s: Tests PySpark > Convert and port 'except-all.sql' into UDF test base > > > Key: SPARK-28278 > URL: https://issues.apache.org/jira/browse/SPARK-28278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28282) Convert and port 'inline-table.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28282: - Component/s: Tests PySpark > Convert and port 'inline-table.sql' into UDF test base > -- > > Key: SPARK-28282 > URL: https://issues.apache.org/jira/browse/SPARK-28282 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28287) Convert and port 'udaf.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28287: - Component/s: Tests PySpark > Convert and port 'udaf.sql' into UDF test base > -- > > Key: SPARK-28287 > URL: https://issues.apache.org/jira/browse/SPARK-28287 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28275) Convert and port 'count.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28275: - Component/s: Tests PySpark > Convert and port 'count.sql' into UDF test base > --- > > Key: SPARK-28275 > URL: https://issues.apache.org/jira/browse/SPARK-28275 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28279) Convert and port 'group-analysis.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28279: - Component/s: Tests PySpark > Convert and port 'group-analysis.sql' into UDF test base > > > Key: SPARK-28279 > URL: https://issues.apache.org/jira/browse/SPARK-28279 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28289) Convert and port 'union.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28289: - Component/s: Tests PySpark > Convert and port 'union.sql' into UDF test base > --- > > Key: SPARK-28289 > URL: https://issues.apache.org/jira/browse/SPARK-28289 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28288: - Component/s: Tests PySpark > Convert and port 'window.sql' into UDF test base > > > Key: SPARK-28288 > URL: https://issues.apache.org/jira/browse/SPARK-28288 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28277) Convert and port 'except.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28277: - Component/s: Tests PySpark > Convert and port 'except.sql' into UDF test base > > > Key: SPARK-28277 > URL: https://issues.apache.org/jira/browse/SPARK-28277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28273: - Component/s: Tests PySpark > Convert and port 'pgSQL/case.sql' into UDF test base > > > Key: SPARK-28273 > URL: https://issues.apache.org/jira/browse/SPARK-28273 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28277) Convert and port 'except.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881629#comment-16881629 ] Hyukjin Kwon commented on SPARK-28277: -- Please go ahead! > Convert and port 'except.sql' into UDF test base > > > Key: SPARK-28277 > URL: https://issues.apache.org/jira/browse/SPARK-28277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28283: - Component/s: Tests PySpark > Convert and port 'intersect-all.sql' into UDF test base > --- > > Key: SPARK-28283 > URL: https://issues.apache.org/jira/browse/SPARK-28283 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28285: - Component/s: Tests PySpark > Convert and port 'outer-join.sql' into UDF test base > > > Key: SPARK-28285 > URL: https://issues.apache.org/jira/browse/SPARK-28285 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27922) Convert and port 'natural-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-27922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27922: - Component/s: Tests > Convert and port 'natural-join.sql' into UDF test base > -- > > Key: SPARK-27922 > URL: https://issues.apache.org/jira/browse/SPARK-27922 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28276: - Component/s: Tests PySpark > Convert and port 'cross-join.sql' into UDF test base > > > Key: SPARK-28276 > URL: https://issues.apache.org/jira/browse/SPARK-28276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881630#comment-16881630 ] Hyukjin Kwon commented on SPARK-28285: -- Thanks. > Convert and port 'outer-join.sql' into UDF test base > > > Key: SPARK-28285 > URL: https://issues.apache.org/jira/browse/SPARK-28285 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28272) Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28272: - Component/s: Tests PySpark > Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base > > > Key: SPARK-28272 > URL: https://issues.apache.org/jira/browse/SPARK-28272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27988 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28274) Convert and port 'pgSQL/window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28274: - Component/s: Tests PySpark > Convert and port 'pgSQL/window.sql' into UDF test base > -- > > Key: SPARK-28274 > URL: https://issues.apache.org/jira/browse/SPARK-28274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-23160 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28284: - Component/s: Tests PySpark > Convert and port 'join-empty-relation.sql' into UDF test base > - > > Key: SPARK-28284 > URL: https://issues.apache.org/jira/browse/SPARK-28284 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28271: - Component/s: PySpark > Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base > > > Key: SPARK-28271 > URL: https://issues.apache.org/jira/browse/SPARK-28271 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27883 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28270: - Component/s: Tests PySpark > Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base > > > Key: SPARK-28270 > URL: https://issues.apache.org/jira/browse/SPARK-28270 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27770 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Component/s: Tests > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > *Here is the rough contribution guide to follow:* > Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if > you're able to do this: > {code:java} > >>> import pandas > >>> pandas.__version__ > '0.23.4' > >>> import pyarrow > >>> pyarrow.__version__ > '0.13.0' > >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) > pyarrow.Table > a: int64 > metadata > > OrderedDict([(b'pandas', > b'{"index_columns": [{"kind": "range", "name": null, "start": ' > b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' > b' "field_name": null, "pandas_type": "unicode", "numpy_type":' > b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' > b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' > b'mpy_type": "int64", "metadata": null}], "creator": {"library' > b'": "pyarrow", "version": "0.13.0"}, "pandas_version": > null}')]) > {code} > > 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} > file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from > {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. > For instance, let's add a comment as below on the top: > {code} > -- This test file was converted from xxx.sql. > {code} > 3. Run it below: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git add . > {code} > 4. Insert one or multiple {{udf(...)}}s into each statement. It is not > required to add more combinations. > And it is not strict about where to insert. Ideally, we should try to put > udf differently for each statement. > 5. Run it below again: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git diff > # or git diff --no-index > sql/core/src/test/resources/sql-tests/results/xxx.sql.out > sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out > {code} > 6. Compare results with original file, > {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} > 7. If there are diff, analyze it, file or find the JIRA, skip the tests with > comments. > 8. Run without generating golden files and check: > {code:java} > build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" > {code} > 9. When you open a PR. please attach {{git diff --no-index > sql/core/src/test/resources/sql-tests/results/xxx.sql.out > sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR > description with the template below: > {code:java} > Diff comparing to 'xxx.sql' > > ```diff > ... # here you put 'git diff' results > ``` > > > {code} > 10. You're ready. Please go for a PR! If the PR contains other minor fixes, > use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is > purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. > See https://github.com/apache/spark/pull/25069 as an example. > Note that registered UDFs all return strings - so there are some differences > are expected. > Note that this JIRA targets plan specific cases in general. > Note that one {{output.sql.out}} file is shared for three UDF test cases > (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. > Note that this guide is supposed to be updated continuously given how it goes. > Note that this test case uses the integrated UDF test base. See > https://github.com/apache/spark/pull/24752 if you're interested in it or find > an issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple {{udf(...)}}s into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains other minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See https://github.com/apache/spark/pull/25069 as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See https://github.com/apache/spark/pull/24752 if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pa
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. For instance, let's add a comment as below on the top: {code} -- This test file was converted from xxx.sql. {code} 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert one or multiple {{udf(...)}}s into each statement. It is not required to add more combinations. And it is not strict about where to insert. Ideally, we should try to put udf differently for each statement. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! If the PR contains minor fixes, use {{[SPARK-X][SQL][PYTHON]}} prefix in the PR title. If the PR is purely about tests, use {{[SPARK-X][SQL][PYTHON][TESTS]}}. See https://github.com/apache/spark/pull/25069 as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. Note that this guide is supposed to be updated continuously given how it goes. Note that this test case uses the integrated UDF test base. See https://github.com/apache/spark/pull/24752 if you're interested in it or find an issue. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas >>> pandas._
[jira] [Comment Edited] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift
[ https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881623#comment-16881623 ] Josh Rosen edited comment on SPARK-27570 at 7/10/19 12:22 AM: -- I ran into a very similar issue, except I was reading from S3 instead of OpenStack Swift. In my reproduction, the addition or removal of filters or projections affected whether I hit the error. In my case, I think the problem was https://issues.apache.org/jira/browse/HADOOP-16109, an issue where Parquet could sometimes use access patterns that hit a bug in seek() in S3AInputStream (/cc [~ste...@apache.org]). I confirmed this by re-running my failing job against an exact copy of the data stored on HDFS (which succeeded). was (Author: joshrosen): I ran into a very similar issue, except I was reading from S3 instead of OpenStack Swift. In my reproduction, the addition or removal of filters or projections affected whether I hit the error. In my case, I think the problem was https://issues.apache.org/jira/browse/HADOOP-16109, an issue where Parquet could sometimes use access patterns that hit a bug in seek() in S3AInputStream (/cc [~ste...@apache.org]). > java.io.EOFException Reached the end of stream - Reading Parquet from Swift > --- > > Key: SPARK-27570 > URL: https://issues.apache.org/jira/browse/SPARK-27570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Harry Hough >Priority: Major > > I did see issue SPARK-25966 but it seems there are some differences as his > problem was resolved after rebuilding the parquet files on write. This is > 100% reproducible for me across many different days of data. > I get exceptions such as "Reached the end of stream with 750477 bytes left to > read" during some read operations of parquet files. I am reading these files > from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4. > The issues seem to happen with the where statement. I have also tried filter > and combining the statements into one as well as the dataset method with > column without any luck. Which column or what the actual filter is on the > where also doesn't seem to make a difference to the error occurring or not. > > {code:java} > val engagementDS = spark > .read > .parquet(createSwiftAddr("engagements", folder)) > .where("engtype != 0") > .where("engtype != 1000") > .groupBy($"accid", $"sessionkey") > .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", > $"testid")).as("engagements")) > // Exiting paste mode, now interpreting. > [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception > in task 24.0 in stage 53.0 (TID 688) > java.io.EOFException: Reached the end of stream with 1323959 bytes left to > read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.execution.aggregate.ObjectHas
[jira] [Commented] (SPARK-28277) Convert and port 'except.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881624#comment-16881624 ] Huaxin Gao commented on SPARK-28277: I will work on this. Thanks. > Convert and port 'except.sql' into UDF test base > > > Key: SPARK-28277 > URL: https://issues.apache.org/jira/browse/SPARK-28277 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881625#comment-16881625 ] Josh Rosen commented on SPARK-25966: Cross-post: there's discussion of a similar issue at https://issues.apache.org/jira/browse/SPARK-27570. Based on that, I suspect that https://issues.apache.org/jira/browse/HADOOP-16109 may fix this problem. > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.jav
[jira] [Commented] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881622#comment-16881622 ] Huaxin Gao commented on SPARK-28285: I will work on this. Thanks. > Convert and port 'outer-join.sql' into UDF test base > > > Key: SPARK-28285 > URL: https://issues.apache.org/jira/browse/SPARK-28285 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift
[ https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881623#comment-16881623 ] Josh Rosen commented on SPARK-27570: I ran into a very similar issue, except I was reading from S3 instead of OpenStack Swift. In my reproduction, the addition or removal of filters or projections affected whether I hit the error. In my case, I think the problem was https://issues.apache.org/jira/browse/HADOOP-16109, an issue where Parquet could sometimes use access patterns that hit a bug in seek() in S3AInputStream (/cc [~ste...@apache.org]). > java.io.EOFException Reached the end of stream - Reading Parquet from Swift > --- > > Key: SPARK-27570 > URL: https://issues.apache.org/jira/browse/SPARK-27570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Harry Hough >Priority: Major > > I did see issue SPARK-25966 but it seems there are some differences as his > problem was resolved after rebuilding the parquet files on write. This is > 100% reproducible for me across many different days of data. > I get exceptions such as "Reached the end of stream with 750477 bytes left to > read" during some read operations of parquet files. I am reading these files > from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4. > The issues seem to happen with the where statement. I have also tried filter > and combining the statements into one as well as the dataset method with > column without any luck. Which column or what the actual filter is on the > where also doesn't seem to make a difference to the error occurring or not. > > {code:java} > val engagementDS = spark > .read > .parquet(createSwiftAddr("engagements", folder)) > .where("engtype != 0") > .where("engtype != 1000") > .groupBy($"accid", $"sessionkey") > .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", > $"testid")).as("engagements")) > // Exiting paste mode, now interpreting. > [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception > in task 24.0 in stage 53.0 (TID 688) > java.io.EOFException: Reached the end of stream with 1323959 bytes left to > read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apac
[jira] [Assigned] (SPARK-27922) Convert and port 'natural-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-27922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27922: Assignee: Apache Spark > Convert and port 'natural-join.sql' into UDF test base > -- > > Key: SPARK-27922 > URL: https://issues.apache.org/jira/browse/SPARK-27922 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27922) Convert and port 'natural-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-27922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27922: Assignee: (was: Apache Spark) > Convert and port 'natural-join.sql' into UDF test base > -- > > Key: SPARK-27922 > URL: https://issues.apache.org/jira/browse/SPARK-27922 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22158) convertMetastore should not ignore storage properties
[ https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881602#comment-16881602 ] Ruslan Dautkhanov commented on SPARK-22158: --- [~dongjoon] I may have misreported it - sorry. [~waleedfateem] made some tests, I thought 2.2.0 is affected as well, but you're probably right that 2.2.1 is the first one affected. Cloudera has pointed to this Jira. Thank you. > convertMetastore should not ignore storage properties > - > > Key: SPARK-22158 > URL: https://issues.apache.org/jira/browse/SPARK-22158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.1, 2.3.0 > > > From the beginning, convertMetastoreOrc ignores table properties and use an > emtpy map instead. It's the same with convertMetastoreParquet. > {code} > val options = Map[String, String]() > {code} > - SPARK-14070: > https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650 > - master: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22158) convertMetastore should not ignore storage properties
[ https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881592#comment-16881592 ] Dongjoon Hyun commented on SPARK-22158: --- [~Tagar]. This is not related to that because that is reported at 2.2.0 and this is merged to 2.2.1. :) > convertMetastore should not ignore storage properties > - > > Key: SPARK-22158 > URL: https://issues.apache.org/jira/browse/SPARK-22158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.1, 2.3.0 > > > From the beginning, convertMetastoreOrc ignores table properties and use an > emtpy map instead. It's the same with convertMetastoreParquet. > {code} > val options = Map[String, String]() > {code} > - SPARK-14070: > https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650 > - master: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28140. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24953 [https://github.com/apache/spark/pull/24953] > Pyspark API to create spark.mllib RowMatrix from DataFrame > -- > > Key: SPARK-28140 > URL: https://issues.apache.org/jira/browse/SPARK-28140 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Henry Davidge >Assignee: Henry Davidge >Priority: Minor > Fix For: 3.0.0 > > > Since many functions are only implemented in spark.mllib, it is often > necessary to convert DataFrames of spark.ml vectors to spark.mllib > distributed matrix formats. The first step, converting the spark.ml vectors > to the spark.mllib equivalent, is straightforward. However, to the best of my > knowledge it's not possible to convert the resulting DataFrame to a RowMatrix > without using a python lambda function, which can have a significant > performance hit. In my recent use case, SVD took 3.5m using the Scala API, > but 12m using Python. > To get around this performance hit, I propose adding a constructor to the > Pyspark RowMatrix class that accepts a DataFrame with a single column of > spark.mllib vectors. I'd be happy to add an equivalent API for > IndexedRowMatrix if there is demand. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28140: -- Priority: Minor (was: Major) > Pyspark API to create spark.mllib RowMatrix from DataFrame > -- > > Key: SPARK-28140 > URL: https://issues.apache.org/jira/browse/SPARK-28140 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Henry Davidge >Priority: Minor > > Since many functions are only implemented in spark.mllib, it is often > necessary to convert DataFrames of spark.ml vectors to spark.mllib > distributed matrix formats. The first step, converting the spark.ml vectors > to the spark.mllib equivalent, is straightforward. However, to the best of my > knowledge it's not possible to convert the resulting DataFrame to a RowMatrix > without using a python lambda function, which can have a significant > performance hit. In my recent use case, SVD took 3.5m using the Scala API, > but 12m using Python. > To get around this performance hit, I propose adding a constructor to the > Pyspark RowMatrix class that accepts a DataFrame with a single column of > spark.mllib vectors. I'd be happy to add an equivalent API for > IndexedRowMatrix if there is demand. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28140: - Assignee: Henry Davidge > Pyspark API to create spark.mllib RowMatrix from DataFrame > -- > > Key: SPARK-28140 > URL: https://issues.apache.org/jira/browse/SPARK-28140 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Henry Davidge >Assignee: Henry Davidge >Priority: Minor > > Since many functions are only implemented in spark.mllib, it is often > necessary to convert DataFrames of spark.ml vectors to spark.mllib > distributed matrix formats. The first step, converting the spark.ml vectors > to the spark.mllib equivalent, is straightforward. However, to the best of my > knowledge it's not possible to convert the resulting DataFrame to a RowMatrix > without using a python lambda function, which can have a significant > performance hit. In my recent use case, SVD took 3.5m using the Scala API, > but 12m using Python. > To get around this performance hit, I propose adding a constructor to the > Pyspark RowMatrix class that accepts a DataFrame with a single column of > spark.mllib vectors. I'd be happy to add an equivalent API for > IndexedRowMatrix if there is demand. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28321) functions.udf(UDF0, DataType) produces unexpected results
[ https://issues.apache.org/jira/browse/SPARK-28321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Matveev updated SPARK-28321: - Description: It looks like that the `f.udf(UDF0, DataType)` variant of the UDF Column-creating methods is wrong ([https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061|https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061):]): {code:java} def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction = { val func = f.asInstanceOf[UDF0[Any]].call() SparkUserDefinedFunction.create(() => func, returnType, inputSchemas = Seq.fill(0)(None)) } {code} Here the UDF passed as the first argument will be called *right inside the `udf` method* on the driver, rather than at the dataframe computation time on executors. One of the major issues here is that non-deterministic UDFs (e.g. generating a random value) will produce unexpected results: {code:java} val scalaudf = f.udf { () => scala.util.Random.nextInt() }.asNondeterministic() val javaudf = f.udf(new UDF0[Int] { override def call(): Int = scala.util.Random.nextInt() }, IntegerType).asNondeterministic() (1 to 100).toDF().select(scalaudf().as("scala"), javaudf().as("java")).show() // prints +---+-+ | scala| java| +---+-+ | 934190385|478543809| |-1082102515|478543809| | 774466710|478543809| | 1883582103|478543809| |-1959743031|478543809| | 1534685218|478543809| | 1158899264|478543809| |-1572590653|478543809| | -309451364|478543809| | -906574467|478543809| | -436584308|478543809| | 1598340674|478543809| |-1331343156|478543809| |-1804177830|478543809| |-1682906106|478543809| | -197444289|478543809| | 260603049|478543809| |-1993515667|478543809| |-1304685845|478543809| | 481017016|478543809| +---+-{code} Note that the version which relies on a different overload of the `functions.udf` method works correctly. was: It looks like that the `f.udf(UDF0, DataType)` variant of the UDF Column-creating methods is wrong ([https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061):] {code:java} def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction = { val func = f.asInstanceOf[UDF0[Any]].call() SparkUserDefinedFunction.create(() => func, returnType, inputSchemas = Seq.fill(0)(None)) } {code} Here the UDF passed as the first argument will be called *right inside the `udf` method* on the driver, rather than at the dataframe computation time on executors. One of the major issues here is that non-deterministic UDFs (e.g. generating a random value) will produce unexpected results: {code:java} val scalaudf = f.udf { () => scala.util.Random.nextInt() }.asNondeterministic() val javaudf = f.udf(new UDF0[Int] { override def call(): Int = scala.util.Random.nextInt() }, IntegerType).asNondeterministic() (1 to 100).toDF().select(scalaudf().as("scala"), javaudf().as("java")).show() // prints +---+-+ | scala| java| +---+-+ | 934190385|478543809| |-1082102515|478543809| | 774466710|478543809| | 1883582103|478543809| |-1959743031|478543809| | 1534685218|478543809| | 1158899264|478543809| |-1572590653|478543809| | -309451364|478543809| | -906574467|478543809| | -436584308|478543809| | 1598340674|478543809| |-1331343156|478543809| |-1804177830|478543809| |-1682906106|478543809| | -197444289|478543809| | 260603049|478543809| |-1993515667|478543809| |-1304685845|478543809| | 481017016|478543809| +---+-{code} Note that the version which relies on a different overload of the `functions.udf` method works correctly. > functions.udf(UDF0, DataType) produces unexpected results > - > > Key: SPARK-28321 > URL: https://issues.apache.org/jira/browse/SPARK-28321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.3 >Reporter: Vladimir Matveev >Priority: Major > > It looks like that the `f.udf(UDF0, DataType)` variant of the UDF > Column-creating methods is wrong > ([https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061|https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061):]): > > {code:java} > def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction = { > val func = f.asInstanceOf[UDF0[Any]].call() > SparkUserDefinedFunction.create(() => func, returnType, inputSchemas = > Seq.fill(0)(Non
[jira] [Created] (SPARK-28321) functions.udf(UDF0, DataType) produces unexpected results
Vladimir Matveev created SPARK-28321: Summary: functions.udf(UDF0, DataType) produces unexpected results Key: SPARK-28321 URL: https://issues.apache.org/jira/browse/SPARK-28321 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 2.3.2 Reporter: Vladimir Matveev It looks like that the `f.udf(UDF0, DataType)` variant of the UDF Column-creating methods is wrong ([https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061):] {code:java} def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction = { val func = f.asInstanceOf[UDF0[Any]].call() SparkUserDefinedFunction.create(() => func, returnType, inputSchemas = Seq.fill(0)(None)) } {code} Here the UDF passed as the first argument will be called *right inside the `udf` method* on the driver, rather than at the dataframe computation time on executors. One of the major issues here is that non-deterministic UDFs (e.g. generating a random value) will produce unexpected results: {code:java} val scalaudf = f.udf { () => scala.util.Random.nextInt() }.asNondeterministic() val javaudf = f.udf(new UDF0[Int] { override def call(): Int = scala.util.Random.nextInt() }, IntegerType).asNondeterministic() (1 to 100).toDF().select(scalaudf().as("scala"), javaudf().as("java")).show() // prints +---+-+ | scala| java| +---+-+ | 934190385|478543809| |-1082102515|478543809| | 774466710|478543809| | 1883582103|478543809| |-1959743031|478543809| | 1534685218|478543809| | 1158899264|478543809| |-1572590653|478543809| | -309451364|478543809| | -906574467|478543809| | -436584308|478543809| | 1598340674|478543809| |-1331343156|478543809| |-1804177830|478543809| |-1682906106|478543809| | -197444289|478543809| | 260603049|478543809| |-1993515667|478543809| |-1304685845|478543809| | 481017016|478543809| +---+-{code} Note that the version which relies on a different overload of the `functions.udf` method works correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28271: -- Component/s: Tests > Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base > > > Key: SPARK-28271 > URL: https://issues.apache.org/jira/browse/SPARK-28271 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27883 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22158) convertMetastore should not ignore storage properties
[ https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881437#comment-16881437 ] Ruslan Dautkhanov edited comment on SPARK-22158 at 7/9/19 6:57 PM: --- [~dongjoon] can you please check if PR-20522 causes SPARK-28266 data correctness regression? Thank you. was (Author: tagar): [~dongjoon] can you please check if this causes SPARK-28266 data correctness regression? Thank you. > convertMetastore should not ignore storage properties > - > > Key: SPARK-22158 > URL: https://issues.apache.org/jira/browse/SPARK-22158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.1, 2.3.0 > > > From the beginning, convertMetastoreOrc ignores table properties and use an > emtpy map instead. It's the same with convertMetastoreParquet. > {code} > val options = Map[String, String]() > {code} > - SPARK-14070: > https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650 > - master: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])
[ https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881462#comment-16881462 ] Dongjoon Hyun commented on SPARK-28310: --- I marked this to `Minor` because this is just a syntax acceptance issue. > ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | > IGNORE NULLS]) > > > Key: SPARK-28310 > URL: https://issues.apache.org/jira/browse/SPARK-28310 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Minor > > According to the ANSI SQL 2011: > {code:sql} > ::= > ::= RESPECT NULLS | IGNORE NULLS > ::= > [ treatment> > ] > ::= > FIRST_VALUE | LAST_VALUE > {code} > Teradata - > [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA] > > Oracle - > [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC] > Redshift – > [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html] > > Postgresql didn't implement the Ignore/respect nulls. > [https://www.postgresql.org/docs/devel/functions-window.html] > h3. Note > The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for > {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This > is not implemented in PostgreSQL: the behavior is always the same as the > standard's default, namely {{RESPECT NULLS}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])
[ https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28310: -- Priority: Minor (was: Major) > ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | > IGNORE NULLS]) > > > Key: SPARK-28310 > URL: https://issues.apache.org/jira/browse/SPARK-28310 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Minor > > According to the ANSI SQL 2011: > {code:sql} > ::= > ::= RESPECT NULLS | IGNORE NULLS > ::= > [ treatment> > ] > ::= > FIRST_VALUE | LAST_VALUE > {code} > Teradata - > [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA] > > Oracle - > [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC] > Redshift – > [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html] > > Postgresql didn't implement the Ignore/respect nulls. > [https://www.postgresql.org/docs/devel/functions-window.html] > h3. Note > The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for > {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This > is not implemented in PostgreSQL: the behavior is always the same as the > standard's default, namely {{RESPECT NULLS}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28234) Spark Resources - add python support to get resources
[ https://issues.apache.org/jira/browse/SPARK-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28234: Assignee: (was: Apache Spark) > Spark Resources - add python support to get resources > - > > Key: SPARK-28234 > URL: https://issues.apache.org/jira/browse/SPARK-28234 > Project: Spark > Issue Type: Story > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Add the equivalent python api for sc.resources and TaskContext.resources -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28234) Spark Resources - add python support to get resources
[ https://issues.apache.org/jira/browse/SPARK-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28234: Assignee: Apache Spark > Spark Resources - add python support to get resources > - > > Key: SPARK-28234 > URL: https://issues.apache.org/jira/browse/SPARK-28234 > Project: Spark > Issue Type: Story > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > Add the equivalent python api for sc.resources and TaskContext.resources -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28234) Spark Resources - add python support to get resources
[ https://issues.apache.org/jira/browse/SPARK-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-28234: - Assignee: Thomas Graves > Spark Resources - add python support to get resources > - > > Key: SPARK-28234 > URL: https://issues.apache.org/jira/browse/SPARK-28234 > Project: Spark > Issue Type: Story > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > Add the equivalent python api for sc.resources and TaskContext.resources -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28271: Assignee: Apache Spark > Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base > > > Key: SPARK-28271 > URL: https://issues.apache.org/jira/browse/SPARK-28271 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > see SPARK-27883 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28271: Assignee: (was: Apache Spark) > Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base > > > Key: SPARK-28271 > URL: https://issues.apache.org/jira/browse/SPARK-28271 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27883 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28320) Spark job eventually fails after several "attempted to access non-existent accumulator" in DAGScheduler
Martin Studer created SPARK-28320: - Summary: Spark job eventually fails after several "attempted to access non-existent accumulator" in DAGScheduler Key: SPARK-28320 URL: https://issues.apache.org/jira/browse/SPARK-28320 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Martin Studer I'm running into an issue where a Spark 2.3.0 (Hortonworks HDP 2.6.5) job eventually fails with {noformat} ERROR ApplicationMaster: User application exited with status 1 INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1) INFO SparkContext: Invoking stop() from shutdown hook {noformat} after receiving several exception of the form {noformat} ERROR DAGScheduler: Failed to update accumulators for task 0 org.apache.spark.SparkException: attempted to access non-existent accumulator 39052 at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1130) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) {noformat} In addition to "attempted to access non-existent accumulator" I have also noticed some (but much less) instances of "Attempted to access garbage collected accumulator": {noformat} ERROR DAGScheduler: Failed to update accumulators for task 0 java.lang.IllegalStateException: Attempted to access garbage collected accumulator 38352 at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) at scala.Option.map(Option.scala:146) at org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1127) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} To provide some more context: This happens in a recursive algorithm implemented in pyspark where I leverage data frame checkpointing to truncate the lineage graph. Checkpointing is done asynchronously by invoking the count action on a different thread when recursing (using Python thread pools). While "attempted to access garbage collected accumulator" seems to be an unexpected (illegal state) exception, it's unclear to me whether "attempted to access non-existent accumulator" is an expected exception in some circumstances, specifically related to checkpointing. The issue looks somewhat related to https://issues.apache.org/jira/browse/SPARK-22371 but that issue does not mention "attempted to access non-existent accumulator". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22158) convertMetastore should not ignore storage properties
[ https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881437#comment-16881437 ] Ruslan Dautkhanov commented on SPARK-22158: --- [~dongjoon] can you please check if this causes SPARK-28266 data correctness regression? Thank you. > convertMetastore should not ignore storage properties > - > > Key: SPARK-22158 > URL: https://issues.apache.org/jira/browse/SPARK-22158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.1, 2.3.0 > > > From the beginning, convertMetastoreOrc ignores table properties and use an > emtpy map instead. It's the same with convertMetastoreParquet. > {code} > val options = Map[String, String]() > {code} > - SPARK-14070: > https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650 > - master: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28319) DataSourceV2: Support SHOW TABLES
Ryan Blue created SPARK-28319: - Summary: DataSourceV2: Support SHOW TABLES Key: SPARK-28319 URL: https://issues.apache.org/jira/browse/SPARK-28319 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue SHOW TABLES needs to support v2 catalogs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881352#comment-16881352 ] Dongjoon Hyun commented on SPARK-28157: --- I raised it as a blocker because this causes a missing information in the event log listing which is the Spark History Server core feature. > Make SHS clear KVStore LogInfo for the blacklisted entries > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Priority: Blocker (was: Major) > Make SHS clear KVStore LogInfo for the blacklisted entries > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org