[jira] [Commented] (SPARK-36476) cloudpickle: ValueError: Cell is empty
[ https://issues.apache.org/jira/browse/SPARK-36476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482308#comment-17482308 ] Pedro Larroy commented on SPARK-36476: -- This seems to happen as an interaction with the package "dill" and only in Python 3.7 This was explained here and I verified the reproduction in my codebas: https://stackoverflow.com/questions/69360462/conflict-between-dill-and-pickle-while-using-pyspark > cloudpickle: ValueError: Cell is empty > -- > > Key: SPARK-36476 > URL: https://issues.apache.org/jira/browse/SPARK-36476 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Oliver Mannion >Priority: Major > > {code:java} > File > "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/serializers.py", > line 437, in dumps > return cloudpickle.dumps(obj, pickle_protocol) > File > "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 101, in dumps > cp.dump(obj) > File > "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 540, in dump > return Pickler.dump(self, obj) > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 437, in dump > self.save(obj) > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 504, in save > f(self, obj) # Call unbound method with explicit self > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 789, in save_tuple > save(element) > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 504, in save > f(self, obj) # Call unbound method with explicit self > File > "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 722, in save_function > *self._dynamic_function_reduce(obj), obj=obj > File > "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 659, in _save_reduce_pickle5 > dictitems=dictitems, obj=obj > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 638, in save_reduce > save(args) > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 504, in save > f(self, obj) # Call unbound method with explicit self > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 789, in save_tuple > save(element) > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 504, in save > f(self, obj) # Call unbound method with explicit self > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 774, in save_tuple > save(element) > File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line > 504, in save > f(self, obj) # Call unbound method with explicit self > File > "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/dill/_dill.py", > line 1226, in save_cell > f = obj.cell_contents > ValueError: Cell is empty > {code} > Doesn't occur in Spark 3.0.0, so possibly introduced when cloudpickle was > upgraded to 1.5.0 (see https://issues.apache.org/jira/browse/SPARK-32094). > Also doesn't occur in Spark 3.1.2 with python 3.8. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482294#comment-17482294 ] Apache Spark commented on SPARK-33326: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35329 > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33326: Assignee: Apache Spark > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Assignee: Apache Spark >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33326: Assignee: (was: Apache Spark) > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
[ https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38032. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35331 [https://github.com/apache/spark/pull/35331] > Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL > - > > Key: SPARK-38032 > URL: https://issues.apache.org/jira/browse/SPARK-38032 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.3.0 > > > We should better test latest PyArrow version. Now 6.0.1 is release but we're > using < 5.0.0 for > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
[ https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38032: Assignee: Hyukjin Kwon > Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL > - > > Key: SPARK-38032 > URL: https://issues.apache.org/jira/browse/SPARK-38032 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.3 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > We should better test latest PyArrow version. Now 6.0.1 is release but we're > using < 5.0.0 for > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
[ https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38031. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35330 [https://github.com/apache/spark/pull/35330] > Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, > Python 3.9) > - > > Key: SPARK-38031 > URL: https://issues.apache.org/jira/browse/SPARK-38031 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.3.0 > > > Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
[ https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38031: Assignee: Hyukjin Kwon > Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, > Python 3.9) > - > > Key: SPARK-38031 > URL: https://issues.apache.org/jira/browse/SPARK-38031 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37946) Use error classes in the execution errors related to partitions
[ https://issues.apache.org/jira/browse/SPARK-37946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482281#comment-17482281 ] Yuto Akutsu commented on SPARK-37946: - [~maxgekk] I will work on this. > Use error classes in the execution errors related to partitions > --- > > Key: SPARK-37946 > URL: https://issues.apache.org/jira/browse/SPARK-37946 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * unableToDeletePartitionPathError > * unableToCreatePartitionPathError > * unableToRenamePartitionPathError > * notADatasourceRDDPartitionError > * cannotClearPartitionDirectoryError > * failedToCastValueToDataTypeForPartitionColumnError > * unsupportedPartitionTransformError > * cannotCreateJDBCTableWithPartitionsError > * requestedPartitionsMismatchTablePartitionsError > * dynamicPartitionKeyNotAmongWrittenPartitionPathsError > * cannotRemovePartitionDirError > * alterTableWithDropPartitionAndPurgeUnsupportedError > * invalidPartitionFilterError > * getPartitionMetadataByFilterError > * illegalLocationClauseForViewPartitionError > * partitionColumnNotFoundInSchemaError > * cannotAddMultiPartitionsOnNonatomicPartitionTableError > * cannotDropMultiPartitionsOnNonatomicPartitionTableError > * truncateMultiPartitionUnsupportedError > * dynamicPartitionOverwriteUnsupportedByTableError > * writePartitionExceedConfigSizeWhenDynamicPartitionError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37937) Use error classes in the parsing errors of lateral join
[ https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482278#comment-17482278 ] Apache Spark commented on SPARK-37937: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/35328 > Use error classes in the parsing errors of lateral join > --- > > Key: SPARK-37937 > URL: https://issues.apache.org/jira/browse/SPARK-37937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * lateralJoinWithNaturalJoinUnsupportedError > * lateralJoinWithUsingJoinUnsupportedError > * unsupportedLateralJoinTypeError > * invalidLateralJoinRelationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37937) Use error classes in the parsing errors of lateral join
[ https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37937: Assignee: Apache Spark > Use error classes in the parsing errors of lateral join > --- > > Key: SPARK-37937 > URL: https://issues.apache.org/jira/browse/SPARK-37937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * lateralJoinWithNaturalJoinUnsupportedError > * lateralJoinWithUsingJoinUnsupportedError > * unsupportedLateralJoinTypeError > * invalidLateralJoinRelationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37937) Use error classes in the parsing errors of lateral join
[ https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37937: Assignee: (was: Apache Spark) > Use error classes in the parsing errors of lateral join > --- > > Key: SPARK-37937 > URL: https://issues.apache.org/jira/browse/SPARK-37937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * lateralJoinWithNaturalJoinUnsupportedError > * lateralJoinWithUsingJoinUnsupportedError > * unsupportedLateralJoinTypeError > * invalidLateralJoinRelationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37937) Use error classes in the parsing errors of lateral join
[ https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482277#comment-17482277 ] Apache Spark commented on SPARK-37937: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/35328 > Use error classes in the parsing errors of lateral join > --- > > Key: SPARK-37937 > URL: https://issues.apache.org/jira/browse/SPARK-37937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * lateralJoinWithNaturalJoinUnsupportedError > * lateralJoinWithUsingJoinUnsupportedError > * unsupportedLateralJoinTypeError > * invalidLateralJoinRelationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38030: Assignee: (was: Apache Spark) > Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 > - > > Key: SPARK-38030 > URL: https://issues.apache.org/jira/browse/SPARK-38030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Priority: Major > > One of our user queries failed in Spark 3.1.1 when using AQE with the > following stacktrace mentioned below (some parts of the plan have been > redacted, but the structure is preserved). > Debugging this issue, we found that the failure was within AQE calling > [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. > The query contains a cast over a column with non-nullable struct fields. > Canonicalization [removes nullability > information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] > from the child {{AttributeReference}} of the Cast, however it does not > remove nullability information from the Cast's target dataType. This causes > the > [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] > to return false because the child is now nullable and cast target data type > is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] > +- Union >:- Project [cast(columnA#30) as struct<...>] >: +- BatchScan[columnA#30] hive.tbl >+- Project [cast(columnA#35) as struct<...>] > +- BatchScan[columnA#35] hive.tbl > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696) > at
[jira] [Commented] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482271#comment-17482271 ] Apache Spark commented on SPARK-38030: -- User 'shardulm94' has created a pull request for this issue: https://github.com/apache/spark/pull/35332 > Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 > - > > Key: SPARK-38030 > URL: https://issues.apache.org/jira/browse/SPARK-38030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Priority: Major > > One of our user queries failed in Spark 3.1.1 when using AQE with the > following stacktrace mentioned below (some parts of the plan have been > redacted, but the structure is preserved). > Debugging this issue, we found that the failure was within AQE calling > [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. > The query contains a cast over a column with non-nullable struct fields. > Canonicalization [removes nullability > information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] > from the child {{AttributeReference}} of the Cast, however it does not > remove nullability information from the Cast's target dataType. This causes > the > [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] > to return false because the child is now nullable and cast target data type > is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] > +- Union >:- Project [cast(columnA#30) as struct<...>] >: +- BatchScan[columnA#30] hive.tbl >+- Project [cast(columnA#35) as struct<...>] > +- BatchScan[columnA#35] hive.tbl > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279) > at
[jira] [Assigned] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38030: Assignee: Apache Spark > Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 > - > > Key: SPARK-38030 > URL: https://issues.apache.org/jira/browse/SPARK-38030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Assignee: Apache Spark >Priority: Major > > One of our user queries failed in Spark 3.1.1 when using AQE with the > following stacktrace mentioned below (some parts of the plan have been > redacted, but the structure is preserved). > Debugging this issue, we found that the failure was within AQE calling > [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. > The query contains a cast over a column with non-nullable struct fields. > Canonicalization [removes nullability > information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] > from the child {{AttributeReference}} of the Cast, however it does not > remove nullability information from the Cast's target dataType. This causes > the > [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] > to return false because the child is now nullable and cast target data type > is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] > +- Union >:- Project [cast(columnA#30) as struct<...>] >: +- BatchScan[columnA#30] hive.tbl >+- Project [cast(columnA#35) as struct<...>] > +- BatchScan[columnA#35] hive.tbl > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696) > at
[jira] [Assigned] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
[ https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38032: Assignee: (was: Apache Spark) > Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL > - > > Key: SPARK-38032 > URL: https://issues.apache.org/jira/browse/SPARK-38032 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.3 >Reporter: Hyukjin Kwon >Priority: Minor > > We should better test latest PyArrow version. Now 6.0.1 is release but we're > using < 5.0.0 for > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
[ https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482242#comment-17482242 ] Apache Spark commented on SPARK-38032: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35331 > Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL > - > > Key: SPARK-38032 > URL: https://issues.apache.org/jira/browse/SPARK-38032 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.3 >Reporter: Hyukjin Kwon >Priority: Minor > > We should better test latest PyArrow version. Now 6.0.1 is release but we're > using < 5.0.0 for > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
[ https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38032: Assignee: Apache Spark > Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL > - > > Key: SPARK-38032 > URL: https://issues.apache.org/jira/browse/SPARK-38032 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.3 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > We should better test latest PyArrow version. Now 6.0.1 is release but we're > using < 5.0.0 for > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
Hyukjin Kwon created SPARK-38032: Summary: Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL Key: SPARK-38032 URL: https://issues.apache.org/jira/browse/SPARK-38032 Project: Spark Issue Type: Test Components: PySpark, SQL Affects Versions: 3.3 Reporter: Hyukjin Kwon We should better test latest PyArrow version. Now 6.0.1 is release but we're using < 5.0.0 for https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
[ https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482239#comment-17482239 ] Apache Spark commented on SPARK-38031: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35330 > Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, > Python 3.9) > - > > Key: SPARK-38031 > URL: https://issues.apache.org/jira/browse/SPARK-38031 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
[ https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38031: Assignee: (was: Apache Spark) > Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, > Python 3.9) > - > > Key: SPARK-38031 > URL: https://issues.apache.org/jira/browse/SPARK-38031 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
[ https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38031: Assignee: Apache Spark > Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, > Python 3.9) > - > > Key: SPARK-38031 > URL: https://issues.apache.org/jira/browse/SPARK-38031 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
[ https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482237#comment-17482237 ] Apache Spark commented on SPARK-38031: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35330 > Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, > Python 3.9) > - > > Key: SPARK-38031 > URL: https://issues.apache.org/jira/browse/SPARK-38031 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38003) Differentiate scalar and table function lookup in LookupFunctions
[ https://issues.apache.org/jira/browse/SPARK-38003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38003: --- Assignee: Allison Wang > Differentiate scalar and table function lookup in LookupFunctions > - > > Key: SPARK-38003 > URL: https://issues.apache.org/jira/browse/SPARK-38003 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Currently, the LookupFunctions rule looks up unresolved scalar functions > using both the scalar function registry and the table function registry. We > should differentiate scalar and table function lookup in the Analyzer rule > LookupFunctions. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38003) Differentiate scalar and table function lookup in LookupFunctions
[ https://issues.apache.org/jira/browse/SPARK-38003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38003. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35304 [https://github.com/apache/spark/pull/35304] > Differentiate scalar and table function lookup in LookupFunctions > - > > Key: SPARK-38003 > URL: https://issues.apache.org/jira/browse/SPARK-38003 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.3.0 > > > Currently, the LookupFunctions rule looks up unresolved scalar functions > using both the scalar function registry and the table function registry. We > should differentiate scalar and table function lookup in the Analyzer rule > LookupFunctions. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
Hyukjin Kwon created SPARK-38031: Summary: Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9) Key: SPARK-38031 URL: https://issues.apache.org/jira/browse/SPARK-38031 Project: Spark Issue Type: Test Components: PySpark Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Update the chart generated by SPARK-32722. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482233#comment-17482233 ] Shardul Mahadik commented on SPARK-38030: - I plan to create a PR to change the canonicalization behavior of {{Cast}} so that nullability information is removed from the target data type of {{Cast}} during canonicalization. However, the canonicalization implementation has changed drastically between Spark 3.1.1 and master, so I will probably create two PRs, 1 for master, 1 for branch-3.1. > Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 > - > > Key: SPARK-38030 > URL: https://issues.apache.org/jira/browse/SPARK-38030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Priority: Major > > One of our user queries failed in Spark 3.1.1 when using AQE with the > following stacktrace mentioned below (some parts of the plan have been > redacted, but the structure is preserved). > Debugging this issue, we found that the failure was within AQE calling > [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. > The query contains a cast over a column with non-nullable struct fields. > Canonicalization [removes nullability > information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] > from the child {{AttributeReference}} of the Cast, however it does not > remove nullability information from the Cast's target dataType. This causes > the > [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] > to return false because the child is now nullable and cast target data type > is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] > +- Union >:- Project [cast(columnA#30) as struct<...>] >: +- BatchScan[columnA#30] hive.tbl >+- Project [cast(columnA#35) as struct<...>] > +- BatchScan[columnA#35] hive.tbl > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at
[jira] [Resolved] (SPARK-37948) Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default
[ https://issues.apache.org/jira/browse/SPARK-37948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37948. -- Resolution: Won't Fix > Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default > > > Key: SPARK-37948 > URL: https://issues.apache.org/jira/browse/SPARK-37948 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: hujiahua >Priority: Major > > The hadoop MR v2 commit algorithm had a correctness issue described by > SPARK-33019, and changed > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default. > But some spark users like me ware unaware of this correctness issue before > and had used v2 commit algorithm in spark 2.x for performance purposes. And > after upgrade to spark 3.x, we encountered this correctness issue in > production environment, caused a very serious failure.The trigger probability > of this issue was higher in new version spark 3.x, and I didn't delve into > the specific reasons. So I propose we should better disable > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 by default, if > users using v2 commit algorithm, then fail the job and warn users this > correctness issue. Or users can choose to force the v2 usage through a new > configuration. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
Shardul Mahadik created SPARK-38030: --- Summary: Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 Key: SPARK-38030 URL: https://issues.apache.org/jira/browse/SPARK-38030 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Shardul Mahadik One of our user queries failed in Spark 3.1.1 when using AQE with the following stacktrace mentioned below (some parts of the plan have been redacted, but the structure is preserved). Debugging this issue, we found that the failure was within AQE calling [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. The query contains a cast over a column with non-nullable struct fields. Canonicalization [removes nullability information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] from the child {{AttributeReference}} of the Cast, however it does not remove nullability information from the Cast's target dataType. This causes the [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] to return false because the child is now nullable and cast target data type is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. {code:java} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] +- Union :- Project [cast(columnA#30) as struct<...>] : +- BatchScan[columnA#30] hive.tbl +- Project [cast(columnA#35) as struct<...>] +- BatchScan[columnA#35] hive.tbl at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at
[jira] [Assigned] (SPARK-33328) Fix Flaky HiveThriftHttpServerSuite
[ https://issues.apache.org/jira/browse/SPARK-33328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33328: Assignee: Apache Spark > Fix Flaky HiveThriftHttpServerSuite > --- > > Key: SPARK-33328 > URL: https://issues.apache.org/jira/browse/SPARK-33328 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > Attachments: failure_rate.png > > > After launching successfully `HiveThriftServer2 started successfully`, the > test fails due to 500 error. > The failure rate is over 50%. (This is an example of the test case `JDBC > query execution` in that suite) > !failure_rate.png|width=508,height=321! > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1516/testReport/] > {code:java} > 09:58:03.853 pool-1-thread-1 INFO HiveThriftHttpServerSuite: Trying to start > HiveThriftServer2: port=14541, mode=http, attempt=0 > 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: COMMAND: > WrappedArray(../../sbin/start-thriftserver.sh, --master, local, --hiveconf, > javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-6f4abc35-f09c-46e6-b6eb-8a310d557e28;create=true, > --hiveconf, > hive.metastore.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-2329343a-3ad4-4bfd-943f-6b46984848b8, > --hiveconf, hive.server2.thrift.bind.host=localhost, --hiveconf, > hive.server2.transport.mode=http, --hiveconf, > hive.server2.logging.operation.log.location=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-77f3b359-1553-40e3-9d75-35c46d2d4d46, > --hiveconf, > hive.exec.local.scratchdir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-8923e61f-36da-4930-b035-6eb3712d41ab, > --hiveconf, hive.server2.thrift.http.port=14541, --driver-class-path, > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-d54a7073-2f02-4331-a84d-bbb3b50a47ac, > --driver-java-options, -Dlog4j.debug, --conf, spark.ui.enabled=false) > 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: OUTPUT: starting > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/logs/spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-06.out > 09:58:38.688 pool-1-thread-1 INFO HiveThriftHttpServerSuite: > HiveThriftServer2 started successfully > 09:58:38.689 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > HiveThriftHttpServerSuite: > = TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.HiveThriftHttpServerSuite: > 'JDBC query execution' = > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > Utils: Supplied authorities: localhost:14541 > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: * JDBC param deprecation * > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: The use of hive.server2.transport.mode is deprecated. > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: Please use transportMode like so: > jdbc:hive2://:/dbName;transportMode= > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: * JDBC param deprecation * > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: The use of hive.server2.thrift.http.path is deprecated. > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: Please use httpPath like so: > jdbc:hive2://:/dbName;httpPath= > 09:58:38.692 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > Utils: Resolved authority: localhost:14541 > 09:58:38.818 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG RequestAddCookies: CookieSpec selected: default > 09:58:38.830 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG RequestAuthCache: Auth cache not set in the context > 09:58:38.832 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG PoolingHttpClientConnectionManager: Connection request: [route: > {}->http://localhost:14541][total available: 0; route allocated: 0 of 2; > total allocated: 0 of 20] > 09:58:38.846 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: > {}->http://localhost:14541][total available: 0; route allocated: 1 of 2; > total
[jira] [Commented] (SPARK-33328) Fix Flaky HiveThriftHttpServerSuite
[ https://issues.apache.org/jira/browse/SPARK-33328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482215#comment-17482215 ] Apache Spark commented on SPARK-33328: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35329 > Fix Flaky HiveThriftHttpServerSuite > --- > > Key: SPARK-33328 > URL: https://issues.apache.org/jira/browse/SPARK-33328 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: failure_rate.png > > > After launching successfully `HiveThriftServer2 started successfully`, the > test fails due to 500 error. > The failure rate is over 50%. (This is an example of the test case `JDBC > query execution` in that suite) > !failure_rate.png|width=508,height=321! > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1516/testReport/] > {code:java} > 09:58:03.853 pool-1-thread-1 INFO HiveThriftHttpServerSuite: Trying to start > HiveThriftServer2: port=14541, mode=http, attempt=0 > 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: COMMAND: > WrappedArray(../../sbin/start-thriftserver.sh, --master, local, --hiveconf, > javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-6f4abc35-f09c-46e6-b6eb-8a310d557e28;create=true, > --hiveconf, > hive.metastore.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-2329343a-3ad4-4bfd-943f-6b46984848b8, > --hiveconf, hive.server2.thrift.bind.host=localhost, --hiveconf, > hive.server2.transport.mode=http, --hiveconf, > hive.server2.logging.operation.log.location=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-77f3b359-1553-40e3-9d75-35c46d2d4d46, > --hiveconf, > hive.exec.local.scratchdir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-8923e61f-36da-4930-b035-6eb3712d41ab, > --hiveconf, hive.server2.thrift.http.port=14541, --driver-class-path, > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-d54a7073-2f02-4331-a84d-bbb3b50a47ac, > --driver-java-options, -Dlog4j.debug, --conf, spark.ui.enabled=false) > 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: OUTPUT: starting > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/logs/spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-06.out > 09:58:38.688 pool-1-thread-1 INFO HiveThriftHttpServerSuite: > HiveThriftServer2 started successfully > 09:58:38.689 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > HiveThriftHttpServerSuite: > = TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.HiveThriftHttpServerSuite: > 'JDBC query execution' = > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > Utils: Supplied authorities: localhost:14541 > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: * JDBC param deprecation * > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: The use of hive.server2.transport.mode is deprecated. > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: Please use transportMode like so: > jdbc:hive2://:/dbName;transportMode= > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: * JDBC param deprecation * > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: The use of hive.server2.thrift.http.path is deprecated. > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: Please use httpPath like so: > jdbc:hive2://:/dbName;httpPath= > 09:58:38.692 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > Utils: Resolved authority: localhost:14541 > 09:58:38.818 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG RequestAddCookies: CookieSpec selected: default > 09:58:38.830 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG RequestAuthCache: Auth cache not set in the context > 09:58:38.832 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG PoolingHttpClientConnectionManager: Connection request: [route: > {}->http://localhost:14541][total available: 0; route allocated: 0 of 2; > total allocated: 0 of 20] > 09:58:38.846 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: >
[jira] [Assigned] (SPARK-33328) Fix Flaky HiveThriftHttpServerSuite
[ https://issues.apache.org/jira/browse/SPARK-33328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33328: Assignee: (was: Apache Spark) > Fix Flaky HiveThriftHttpServerSuite > --- > > Key: SPARK-33328 > URL: https://issues.apache.org/jira/browse/SPARK-33328 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: failure_rate.png > > > After launching successfully `HiveThriftServer2 started successfully`, the > test fails due to 500 error. > The failure rate is over 50%. (This is an example of the test case `JDBC > query execution` in that suite) > !failure_rate.png|width=508,height=321! > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1516/testReport/] > {code:java} > 09:58:03.853 pool-1-thread-1 INFO HiveThriftHttpServerSuite: Trying to start > HiveThriftServer2: port=14541, mode=http, attempt=0 > 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: COMMAND: > WrappedArray(../../sbin/start-thriftserver.sh, --master, local, --hiveconf, > javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-6f4abc35-f09c-46e6-b6eb-8a310d557e28;create=true, > --hiveconf, > hive.metastore.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-2329343a-3ad4-4bfd-943f-6b46984848b8, > --hiveconf, hive.server2.thrift.bind.host=localhost, --hiveconf, > hive.server2.transport.mode=http, --hiveconf, > hive.server2.logging.operation.log.location=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-77f3b359-1553-40e3-9d75-35c46d2d4d46, > --hiveconf, > hive.exec.local.scratchdir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-8923e61f-36da-4930-b035-6eb3712d41ab, > --hiveconf, hive.server2.thrift.http.port=14541, --driver-class-path, > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-d54a7073-2f02-4331-a84d-bbb3b50a47ac, > --driver-java-options, -Dlog4j.debug, --conf, spark.ui.enabled=false) > 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: OUTPUT: starting > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/logs/spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-06.out > 09:58:38.688 pool-1-thread-1 INFO HiveThriftHttpServerSuite: > HiveThriftServer2 started successfully > 09:58:38.689 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > HiveThriftHttpServerSuite: > = TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.HiveThriftHttpServerSuite: > 'JDBC query execution' = > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > Utils: Supplied authorities: localhost:14541 > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: * JDBC param deprecation * > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: The use of hive.server2.transport.mode is deprecated. > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: Please use transportMode like so: > jdbc:hive2://:/dbName;transportMode= > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: * JDBC param deprecation * > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: The use of hive.server2.thrift.http.path is deprecated. > 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN > Utils: Please use httpPath like so: > jdbc:hive2://:/dbName;httpPath= > 09:58:38.692 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO > Utils: Resolved authority: localhost:14541 > 09:58:38.818 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG RequestAddCookies: CookieSpec selected: default > 09:58:38.830 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG RequestAuthCache: Auth cache not set in the context > 09:58:38.832 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG PoolingHttpClientConnectionManager: Connection request: [route: > {}->http://localhost:14541][total available: 0; route allocated: 0 of 2; > total allocated: 0 of 20] > 09:58:38.846 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite > DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: > {}->http://localhost:14541][total available: 0; route allocated: 1 of 2; > total allocated: 1 of 20] >
[jira] [Resolved] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-38013. --- Resolution: Won't Fix > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482209#comment-17482209 ] XiDuo You commented on SPARK-38013: --- seems it is allowed in AQE, not a bug otherwise .. > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-38013: -- Issue Type: Task (was: Bug) > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)
[ https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-30062. Fix Version/s: 3.2.2 3.3 Resolution: Fixed > bug with DB2Driver using mode("overwrite") option("truncate",True) > -- > > Key: SPARK-30062 > URL: https://issues.apache.org/jira/browse/SPARK-30062 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Guy Huinen >Priority: Major > Labels: db2, pyspark > Fix For: 3.2.2, 3.3 > > > using DB2Driver using mode("overwrite") option("truncate",True) gives sql > error > > {code:java} > dfClient.write\ > .format("jdbc")\ > .mode("overwrite")\ > .option('driver', 'com.ibm.db2.jcc.DB2Driver')\ > .option("url","jdbc:db2://")\ > .option("user","xxx")\ > .option("password","")\ > .option("dbtable","")\ > .option("truncate",True)\{code} > > gives the error below > in summary i belief the semicolon is misplaced or malformated > > {code:java} > EXPO.EXPO#CMR_STG;IMMEDIATE{code} > > > full error > {code:java} > An error occurred while calling o47.save. : > com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, > SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, > DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at > com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) > at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at > com.ibm.db2.jcc.am.kh.d(kh.java:2776) at > com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) > at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) > at com.ibm.db2.jcc.t4.av.h(av.java:124) at > com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at > com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) > at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at >
[jira] [Commented] (SPARK-37858) Throw Spark exceptions from AES functions
[ https://issues.apache.org/jira/browse/SPARK-37858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482197#comment-17482197 ] Apache Spark commented on SPARK-37858: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/35328 > Throw Spark exceptions from AES functions > - > > Key: SPARK-37858 > URL: https://issues.apache.org/jira/browse/SPARK-37858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Currently, Spark SQL can throw Java exceptions from the > aes_encrypt()/aes_decrypt() functions, for instance: > {code:java} > java.lang.RuntimeException: javax.crypto.AEADBadTagException: Tag mismatch! > at > org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:93) > at > org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesDecrypt(ExpressionImplUtils.java:43) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:354) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:136) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: javax.crypto.AEADBadTagException: Tag mismatch! > at > com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620) > at > com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116) > at > com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053) > at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853) > at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446) > at javax.crypto.Cipher.doFinal(Cipher.java:2226) > at > org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:87) > ... 19 more > {code} > That might confuse non-Scala/Java users. Need to wrap such kind of exception > by Spark's exception. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37858) Throw Spark exceptions from AES functions
[ https://issues.apache.org/jira/browse/SPARK-37858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482196#comment-17482196 ] Apache Spark commented on SPARK-37858: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/35328 > Throw Spark exceptions from AES functions > - > > Key: SPARK-37858 > URL: https://issues.apache.org/jira/browse/SPARK-37858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Currently, Spark SQL can throw Java exceptions from the > aes_encrypt()/aes_decrypt() functions, for instance: > {code:java} > java.lang.RuntimeException: javax.crypto.AEADBadTagException: Tag mismatch! > at > org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:93) > at > org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesDecrypt(ExpressionImplUtils.java:43) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:354) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:136) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: javax.crypto.AEADBadTagException: Tag mismatch! > at > com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620) > at > com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116) > at > com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053) > at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853) > at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446) > at javax.crypto.Cipher.doFinal(Cipher.java:2226) > at > org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:87) > ... 19 more > {code} > That might confuse non-Scala/Java users. Need to wrap such kind of exception > by Spark's exception. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-38013: -- Summary: AQE can change bhj to smj if no extra shuffle introduce (was: Fix AQE can change bhj to smj if no extra shuffle introduce) > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37995) TPCDS 1TB q72 fails when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false
[ https://issues.apache.org/jira/browse/SPARK-37995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482194#comment-17482194 ] Hyukjin Kwon commented on SPARK-37995: -- cc [~maryannxue] FYI > TPCDS 1TB q72 fails when > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false > > > Key: SPARK-37995 > URL: https://issues.apache.org/jira/browse/SPARK-37995 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Kapil Singh >Priority: Major > Attachments: full-stacktrace.txt > > > TPCDS 1TB q72 fails in 3.2 Spark when > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false. We > have been running with this config in 3.1 as well and it worked fine in that > version. This used to add a subquery dpp in q72. > Relevant stack trace > {code:java} > rror: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to > org.apache.spark.sql.execution.SparkPlan at > scala.collection.immutable.List.map(List.scala:293) at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:75) > at > org.apache.spark.sql.execution.SparkPlanInfo$.$anonfun$fromSparkPlan$3(SparkPlanInfo.scala:75) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > > > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:75) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:708) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$2(AdaptiveSparkPlanExec.scala:239) > at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.java:23) > at scala.Option.foreach(Option.scala:407) at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:239) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:226) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:365) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37996) Contribution guide is stale
[ https://issues.apache.org/jira/browse/SPARK-37996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482193#comment-17482193 ] Hyukjin Kwon commented on SPARK-37996: -- Hm, yeah. I think now we always run the tests by default once you push a commit to your forked repo so we won't need it anymore. Interested in submitting a PR? We should fix it at in https://github.com/apache/spark-website > Contribution guide is stale > --- > > Key: SPARK-37996 > URL: https://issues.apache.org/jira/browse/SPARK-37996 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Khalid Mammadov >Priority: Minor > > Contribution guide mentions below link to use to test on local repo before > raising PR but the process has changed and documentation does not reflect it. > https://spark.apache.org/developer-tools.html#github-workflow-tests > Only digging into git log of " > [.github/workflows/build_and_test.yml|https://github.com/apache/spark/commit/2974b70d1efd4b1c5cfe7e2467766f0a9a1fec82#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2]; > I managed to find what the new process is. It was changed in > [https://github.com/apache/spark/pull/32092] but documentation was not > updated. > I am happy to contribute to fix it but apparently > [https://spark.apache.org/developer-tools.html] is hosted in Apache website > rather that in the Spark source code -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37997) Allow query parameters to be passed into spark.read
[ https://issues.apache.org/jira/browse/SPARK-37997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482192#comment-17482192 ] Hyukjin Kwon commented on SPARK-37997: -- Can we just format it before passing to spark.read? > Allow query parameters to be passed into spark.read > --- > > Key: SPARK-37997 > URL: https://issues.apache.org/jira/browse/SPARK-37997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: QFW >Priority: Minor > > This ticket is to allow query parameters to be used spark.read. > While it is possible to inject some parameters into the query using string > concatenation, this doesn't work for all data types, for example binaries. In > this example, the parameter rowversion is a binary which needs to be passed > into the sql query. > {code:java} > _select_sql = f'SELECT * FROM dbo.Table WHERE RowVersion > {rowversion}' > df = spark.read.format("jdbc") \ > .option("url", > "jdbc:sqlserver://databaseserver.database.windows.net;databaseName=databasename") > \ > .option("query", _select_sql) \ > .option("username", "sql_username") \ > .option("password", "sql_password") \ > .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ > .load() {code} > This results in the query looking this this... > {code:java} > SELECT * FROM dbo.Address WHERE RowVersion > > bytearray(b'\x00\x00\x00\x00\x02\xdf=\xf5') {code} > As far as I know, there is no way to do this currently. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38000) Sort node incorrectly removed from the optimized logical plan
[ https://issues.apache.org/jira/browse/SPARK-38000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38000. -- Resolution: Cannot Reproduce > Sort node incorrectly removed from the optimized logical plan > - > > Key: SPARK-38000 > URL: https://issues.apache.org/jira/browse/SPARK-38000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: Tested on: > Ubuntu 18.04.2 LTS > OpenJDK 1.8.0_312 64-Bit Server VM (build 25.312-b07, mixed mode) >Reporter: Antoine Wendlinger >Priority: Major > Labels: correctness > > When using a fairly involved combination of joins, windows, cache and > orderBy, the sorting phase disappears from the optimized logical plan and the > resulting dataframe is not sorted. > You can find a reproduction of the bug in > [https://github.com/antoinewdg/spark-bug-report|http://example.com/]. > Use {{sbt run}} to get the results. > The bug is very niche, I chose to report it because it looks like a > correctness issue, and may be a symptom of a larger one. > The bug affects only 3.2.0, tests on 3.1.2 show the result correctly sorted. > As far as I could test it, all steps in the reproduction are necessary for > the bug to happen: > * the join with an empty dataframe > * the distinct call on the empty dataframe > * the window function > * the cache after the order by > h2. Code > > {code:scala} > val players = (10 to 20).map(x => Player(id = x.toString)).toDS > val blacklist = sparkSession > .emptyDataset[BlacklistEntry] > .distinct() > val result = players > .join(blacklist, Seq("id"), "left_outer") > .withColumn("rank", > row_number().over(Window.partitionBy("id").orderBy("id"))) > .orderBy("id") > .cache() > result.show() > result.explain(true) > {code} > > h2. Output > > {code:java} > +---++ > | id|rank| > +---++ > | 15| 1| > | 11| 1| > | 16| 1| > | 18| 1| > | 17| 1| > | 19| 1| > | 20| 1| > | 10| 1| > | 12| 1| > | 13| 1| > | 14| 1| > +---++ > == Parsed Logical Plan == > 'Sort ['id ASC NULLS FIRST], true > +- Project [id#1, rank#10] >+- Project [id#1, rank#10, rank#10] > +- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS rank#10], [id#1], [id#1 ASC NULLS FIRST] > +- Project [id#1] > +- Project [id#1] >+- Join LeftOuter, (id#1 = id#5) > :- LocalRelation [id#1] > +- Deduplicate [id#5] > +- LocalRelation , [id#5] > == Analyzed Logical Plan == > id: string, rank: int > Sort [id#1 ASC NULLS FIRST], true > +- Project [id#1, rank#10] >+- Project [id#1, rank#10, rank#10] > +- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS rank#10], [id#1], [id#1 ASC NULLS FIRST] > +- Project [id#1] > +- Project [id#1] >+- Join LeftOuter, (id#1 = id#5) > :- LocalRelation [id#1] > +- Deduplicate [id#5] > +- LocalRelation , [id#5] > == Optimized Logical Plan == > InMemoryRelation [id#1, rank#10], StorageLevel(disk, memory, deserialized, 1 > replicas) >+- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > rank#10], [id#1], [id#1 ASC NULLS FIRST] > +- *(1) Sort [id#1 ASC NULLS FIRST, id#1 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#1, 200), ENSURE_REQUIREMENTS, [id=#7] > +- LocalTableScan [id#1] > == Physical Plan == > InMemoryTableScan [id#1, rank#10] >+- InMemoryRelation [id#1, rank#10], StorageLevel(disk, memory, > deserialized, 1 replicas) > +- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS rank#10], [id#1], [id#1 ASC NULLS FIRST] > +- *(1) Sort [id#1 ASC NULLS FIRST, id#1 ASC NULLS FIRST], false, > 0 >+- Exchange hashpartitioning(id#1, 200), ENSURE_REQUIREMENTS, > [id=#7] > +- LocalTableScan [id#1]{quote} > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector
[ https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38028. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35326 [https://github.com/apache/spark/pull/35326] > Expose Arrow Vector from ArrowColumnVector > -- > > Key: SPARK-38028 > URL: https://issues.apache.org/jira/browse/SPARK-38028 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.3.0 > > > In some cases we need to work with Arrow Vectors behind ColumnVector using > Arrow APIs. For example, some Spark extension libraries need to consume Arrow > Vectors. For now, it is impossible as the Arrow Vector is private member in > ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482188#comment-17482188 ] Wenchen Fan commented on SPARK-37980: - I think it's possible for the parquet data sources because Spark uses very low-level Parquet APIs and we can do many customizations. > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3 >Reporter: Prakhar Jain >Priority: Major > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482142#comment-17482142 ] Prakhar Jain edited comment on SPARK-37980 at 1/26/22, 1:58 AM: Yeah - this will need implementation for the underlying file format e.g. parquet/orc. We can start with parquet first and extend it to other formats after that. [~cloud_fan] Is it possible to add the support for parquet directly via Spark codebase? Will this need changes in parquet-mr? was (Author: prakharjain09): Yes - this needs implementation in the underlying datasources such as parquet/orc. Also Spark uses the underlying ParquetRecordReader from parquet-mr to read a parquet file. All the row group skipping/column index filtering happens as part of parquet-mr. So I guess this will need the row index support from parquet-mr. The other way is to replicate some of the parquet-mr RecordReader code in Spark - which is not ideal. > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3 >Reporter: Prakhar Jain >Priority: Major > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38029) Support K8S integration test in SBT
[ https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38029: Assignee: (was: Apache Spark) > Support K8S integration test in SBT > --- > > Key: SPARK-38029 > URL: https://issues.apache.org/jira/browse/SPARK-38029 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38029) Support K8S integration test in SBT
[ https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482178#comment-17482178 ] Apache Spark commented on SPARK-38029: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35327 > Support K8S integration test in SBT > --- > > Key: SPARK-38029 > URL: https://issues.apache.org/jira/browse/SPARK-38029 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38029) Support K8S integration test in SBT
[ https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38029: Assignee: Apache Spark > Support K8S integration test in SBT > --- > > Key: SPARK-38029 > URL: https://issues.apache.org/jira/browse/SPARK-38029 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector
[ https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38028: Assignee: Apache Spark (was: L. C. Hsieh) > Expose Arrow Vector from ArrowColumnVector > -- > > Key: SPARK-38028 > URL: https://issues.apache.org/jira/browse/SPARK-38028 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > In some cases we need to work with Arrow Vectors behind ColumnVector using > Arrow APIs. For example, some Spark extension libraries need to consume Arrow > Vectors. For now, it is impossible as the Arrow Vector is private member in > ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector
[ https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38028: Assignee: L. C. Hsieh (was: Apache Spark) > Expose Arrow Vector from ArrowColumnVector > -- > > Key: SPARK-38028 > URL: https://issues.apache.org/jira/browse/SPARK-38028 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > In some cases we need to work with Arrow Vectors behind ColumnVector using > Arrow APIs. For example, some Spark extension libraries need to consume Arrow > Vectors. For now, it is impossible as the Arrow Vector is private member in > ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector
[ https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482176#comment-17482176 ] Apache Spark commented on SPARK-38028: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/35326 > Expose Arrow Vector from ArrowColumnVector > -- > > Key: SPARK-38028 > URL: https://issues.apache.org/jira/browse/SPARK-38028 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > In some cases we need to work with Arrow Vectors behind ColumnVector using > Arrow APIs. For example, some Spark extension libraries need to consume Arrow > Vectors. For now, it is impossible as the Arrow Vector is private member in > ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38029) Support K8S integration test in SBT
William Hyun created SPARK-38029: Summary: Support K8S integration test in SBT Key: SPARK-38029 URL: https://issues.apache.org/jira/browse/SPARK-38029 Project: Spark Issue Type: Test Components: Kubernetes, Tests Affects Versions: 3.3.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector
[ https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-38028: --- Assignee: L. C. Hsieh > Expose Arrow Vector from ArrowColumnVector > -- > > Key: SPARK-38028 > URL: https://issues.apache.org/jira/browse/SPARK-38028 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > In some cases we need to work with Arrow Vectors behind ColumnVector using > Arrow APIs. For example, some Spark extension libraries need to consume Arrow > Vectors. For now, it is impossible as the Arrow Vector is private member in > ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector
L. C. Hsieh created SPARK-38028: --- Summary: Expose Arrow Vector from ArrowColumnVector Key: SPARK-38028 URL: https://issues.apache.org/jira/browse/SPARK-38028 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: L. C. Hsieh In some cases we need to work with Arrow Vectors behind ColumnVector using Arrow APIs. For example, some Spark extension libraries need to consume Arrow Vectors. For now, it is impossible as the Arrow Vector is private member in ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174 ] Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:34 AM: --- [~Saikrishna_Pujari] Thanks for the report the issue! Actually the ambiguous issue can be handled by setting `spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents this as a workaround for now. Are you interested in creating a PR ?? was (Author: itholic): [~Saikrishna_Pujari] Thanks for the report the issue! Actually the ambiguous issue can be handled by setting `spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents this as a workaround for now. Do you mind to submit a PR ?? > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38004: Affects Version/s: 3.2.0 (was: 3.1.2) > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174 ] Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:33 AM: --- [~Saikrishna_Pujari] Thanks for the report the issue! Actually the ambiguous issue can be handled by setting `spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents this as a workaround for now. Do you mind to submit a PR ?? was (Author: itholic): [~Saikrishna_Pujari] Thanks for the report the issue! Actually the ambiguous issue can be handled by setting `spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents this workaround for now. Do you mind to submit a PR ?? > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174 ] Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:33 AM: --- [~Saikrishna_Pujari] Thanks for the report the issue! Actually the ambiguous issue can be handled by setting `spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents this workaround for now. Do you mind to submit a PR ?? was (Author: itholic): [~Saikrishna_Pujari] Thanks for the report the issue! Setting `spark.conf.set("spark.sql.caseSensitive","true")` would make `mangle_dupe_cols` work, so I think we can documents this workaround for now. Do you want to submit a PR ?? > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174 ] Haejoon Lee commented on SPARK-38004: - [~Saikrishna_Pujari] Thanks for the report the issue! Setting `spark.conf.set("spark.sql.caseSensitive","true")` would make `mangle_dupe_cols` work, so I think we can documents this workaround for now. > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174 ] Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:30 AM: --- [~Saikrishna_Pujari] Thanks for the report the issue! Setting `spark.conf.set("spark.sql.caseSensitive","true")` would make `mangle_dupe_cols` work, so I think we can documents this workaround for now. Do you want to submit a PR ?? was (Author: itholic): [~Saikrishna_Pujari] Thanks for the report the issue! Setting `spark.conf.set("spark.sql.caseSensitive","true")` would make `mangle_dupe_cols` work, so I think we can documents this workaround for now. > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38015) Mark legacy file naming functions as deprecated in FileCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-38015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38015: Assignee: Cheng Su > Mark legacy file naming functions as deprecated in FileCommitProtocol > - > > Key: SPARK-38015 > URL: https://issues.apache.org/jira/browse/SPARK-38015 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > > [FileCommitProtocol|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala] > is the class to commit Spark job output (staging file & directory renaming, > etc). During Spark 3.2 development, we added new functions into this class to > allow more flexible output file naming (the PR detail is > [here|https://github.com/apache/spark/pull/33012]). We didn’t delete the > existing file naming functions (newTaskTempFile(ext) & > newTaskTempFileAbsPath(ext)), because we were aware of many other downstream > projects or codebases already implemented their own custom implementation for > FileCommitProtocol. Delete the existing functions would be a breaking change > for them when upgrading Spark version, and we would like to avoid this > unpleasant surprise for anyone if possible. But we also need to clean up > legacy as we evolve our codebase. > So for next step, I would like to propose: > * Spark 3.3 (now): Add @deprecate annotation to legacy functions in > FileCommitProtocol - > [newTaskTempFile(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L98] > & > [newTaskTempFileAbsPath(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L135]. > * Next Spark major release (or whenever people feel comfortable): delete the > legacy functions mentioned above from our codebase. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38015) Mark legacy file naming functions as deprecated in FileCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-38015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38015. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35311 [https://github.com/apache/spark/pull/35311] > Mark legacy file naming functions as deprecated in FileCommitProtocol > - > > Key: SPARK-38015 > URL: https://issues.apache.org/jira/browse/SPARK-38015 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > [FileCommitProtocol|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala] > is the class to commit Spark job output (staging file & directory renaming, > etc). During Spark 3.2 development, we added new functions into this class to > allow more flexible output file naming (the PR detail is > [here|https://github.com/apache/spark/pull/33012]). We didn’t delete the > existing file naming functions (newTaskTempFile(ext) & > newTaskTempFileAbsPath(ext)), because we were aware of many other downstream > projects or codebases already implemented their own custom implementation for > FileCommitProtocol. Delete the existing functions would be a breaking change > for them when upgrading Spark version, and we would like to avoid this > unpleasant surprise for anyone if possible. But we also need to clean up > legacy as we evolve our codebase. > So for next step, I would like to propose: > * Spark 3.3 (now): Add @deprecate annotation to legacy functions in > FileCommitProtocol - > [newTaskTempFile(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L98] > & > [newTaskTempFileAbsPath(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L135]. > * Next Spark major release (or whenever people feel comfortable): delete the > legacy functions mentioned above from our codebase. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37793) Invalid LocalMergedBlockData cause task hang
[ https://issues.apache.org/jira/browse/SPARK-37793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482148#comment-17482148 ] Apache Spark commented on SPARK-37793: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/35325 > Invalid LocalMergedBlockData cause task hang > > > Key: SPARK-37793 > URL: https://issues.apache.org/jira/browse/SPARK-37793 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Priority: Critical > > When enable push-based shuffle, there is a chance that task hang > > {code:java} > 59Executor task launch worker for task 424.0 in stage 753.0 (TID 106778) > WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1660371198}) > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044) > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:753) > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85) > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.sort_addToSorter_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.smj_findNextJoinRows_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_1$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > org.apache.spark.scheduler.Task.run(Task.scala:136) > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > org.apache.spark.executor.Executor$TaskRunner$$Lambda$518/852390142.apply(Unknown > Source) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > ShuffleBlockFetcherIterator.scala:753 > {code:java} > while (result == null) { > val startFetchWait = System.nanoTime() > 753> result = results.take() > val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - > startFetchWait) > shuffleMetrics.incFetchWaitTime(fetchWaitTime) > .. > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37793) Invalid LocalMergedBlockData cause task hang
[ https://issues.apache.org/jira/browse/SPARK-37793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482149#comment-17482149 ] Apache Spark commented on SPARK-37793: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/35325 > Invalid LocalMergedBlockData cause task hang > > > Key: SPARK-37793 > URL: https://issues.apache.org/jira/browse/SPARK-37793 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Priority: Critical > > When enable push-based shuffle, there is a chance that task hang > > {code:java} > 59Executor task launch worker for task 424.0 in stage 753.0 (TID 106778) > WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1660371198}) > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044) > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:753) > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85) > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.sort_addToSorter_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.smj_findNextJoinRows_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_1$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > org.apache.spark.scheduler.Task.run(Task.scala:136) > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > org.apache.spark.executor.Executor$TaskRunner$$Lambda$518/852390142.apply(Unknown > Source) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > ShuffleBlockFetcherIterator.scala:753 > {code:java} > while (result == null) { > val startFetchWait = System.nanoTime() > 753> result = results.take() > val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - > startFetchWait) > shuffleMetrics.incFetchWaitTime(fetchWaitTime) > .. > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37793) Invalid LocalMergedBlockData cause task hang
[ https://issues.apache.org/jira/browse/SPARK-37793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482147#comment-17482147 ] Apache Spark commented on SPARK-37793: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/35325 > Invalid LocalMergedBlockData cause task hang > > > Key: SPARK-37793 > URL: https://issues.apache.org/jira/browse/SPARK-37793 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Priority: Critical > > When enable push-based shuffle, there is a chance that task hang > > {code:java} > 59Executor task launch worker for task 424.0 in stage 753.0 (TID 106778) > WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1660371198}) > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044) > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:753) > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85) > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.sort_addToSorter_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.smj_findNextJoinRows_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_1$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > org.apache.spark.scheduler.Task.run(Task.scala:136) > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > org.apache.spark.executor.Executor$TaskRunner$$Lambda$518/852390142.apply(Unknown > Source) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > ShuffleBlockFetcherIterator.scala:753 > {code:java} > while (result == null) { > val startFetchWait = System.nanoTime() > 753> result = results.take() > val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - > startFetchWait) > shuffleMetrics.incFetchWaitTime(fetchWaitTime) > .. > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37675) Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block
[ https://issues.apache.org/jira/browse/SPARK-37675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482146#comment-17482146 ] Apache Spark commented on SPARK-37675: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/35325 > Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block > -- > > Key: SPARK-37675 > URL: https://issues.apache.org/jira/browse/SPARK-37675 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37675) Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block
[ https://issues.apache.org/jira/browse/SPARK-37675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482145#comment-17482145 ] Apache Spark commented on SPARK-37675: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/35325 > Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block > -- > > Key: SPARK-37675 > URL: https://issues.apache.org/jira/browse/SPARK-37675 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482142#comment-17482142 ] Prakhar Jain commented on SPARK-37980: -- Yes - this needs implementation in the underlying datasources such as parquet/orc. Also Spark uses the underlying ParquetRecordReader from parquet-mr to read a parquet file. All the row group skipping/column index filtering happens as part of parquet-mr. So I guess this will need the row index support from parquet-mr. The other way is to replicate some of the parquet-mr RecordReader code in Spark - which is not ideal. > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3 >Reporter: Prakhar Jain >Priority: Major > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite
[ https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38022: -- Fix Version/s: (was: 3.2.2) > Use relativePath for K8s remote file test in BasicTestsSuite > > > Key: SPARK-38022 > URL: https://issues.apache.org/jira/browse/SPARK-38022 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > *BEFORE* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 > minutes, 3 seconds) > [info] The code passed to eventually never returned normally. Attempted 190 > times over 3.01226506667 minutes. Last failure message: false was not > true. (KubernetesSuite.scala:452) > ... {code} > *AFTER* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 > milliseconds){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
[ https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38023: -- Fix Version/s: 3.2.2 (was: 3.2.1) > ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as > finished > > > Key: SPARK-38023 > URL: https://issues.apache.org/jira/browse/SPARK-38023 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
[ https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38023: - Assignee: Dongjoon Hyun > ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as > finished > > > Key: SPARK-38023 > URL: https://issues.apache.org/jira/browse/SPARK-38023 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
[ https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38023. --- Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 35321 [https://github.com/apache/spark/pull/35321] > ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as > finished > > > Key: SPARK-38023 > URL: https://issues.apache.org/jira/browse/SPARK-38023 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family
[ https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482128#comment-17482128 ] Evan Zamir commented on SPARK-38027: Looking into this further I think the issue is arising upon serializing the model either logging it or persisting it to disk. From my logs: 2022-01-25 14:21:33,664 root ERROR An error occurred while calling o1538.toString. : java.util.NoSuchElementException: Failed to find a default value for link at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) > Undefined link function causing error in GLM that uses Tweedie family > - > > Key: SPARK-38027 > URL: https://issues.apache.org/jira/browse/SPARK-38027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.1.2 > Environment: Running on Mac OS X Monterey >Reporter: Evan Zamir >Priority: Major > Labels: GLM, pyspark > > I am trying to use the GLM regression with a Tweedie distribution so I can > model insurance use cases. I have set up a very simple example adapted from > the docs: > {code:python} > def create_fake_losses_data(self): > df = self._spark.createDataFrame([ > ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)), > ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)), > ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)), > ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", > "label", "offset", "weight", "features"]) > logging.info(df.collect()) > setattr(self, 'fake_data', df) > try: > glr = GeneralizedLinearRegression( > family="tweedie", variancePower=1.5, linkPower=-1, > offsetCol='offset') > glr.setRegParam(0.3) > model = glr.fit(df) > logging.info(model) > except Py4JJavaError as e: > print(e) > return self > {code} > This causes the following error: > *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString. > : java.util.NoSuchElementException: Failed to find a default value for link* > at > org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) > at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) > at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) > at org.apache.spark.ml.param.Params.$(params.scala:762) > at org.apache.spark.ml.param.Params.$$(params.scala:762) > at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) > at > org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at >
[jira] [Commented] (SPARK-37896) ConstantColumnVector: a column vector with same values
[ https://issues.apache.org/jira/browse/SPARK-37896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482117#comment-17482117 ] Apache Spark commented on SPARK-37896: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/35324 > ConstantColumnVector: a column vector with same values > -- > > Key: SPARK-37896 > URL: https://issues.apache.org/jira/browse/SPARK-37896 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.3.0 > > > Introduce a new column vector named `ConstantColumnVector`, it represents a > column vector where every row has the same constant value. > It could help improve performance on hidden file metadata columnar file > format, since metadata fields for every row in each file are exactly the > same, we don't need to copy and keep multiple copies of data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family
Evan Zamir created SPARK-38027: -- Summary: Undefined link function causing error in GLM that uses Tweedie family Key: SPARK-38027 URL: https://issues.apache.org/jira/browse/SPARK-38027 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.1.2 Environment: Running on Mac OS X Monterey Reporter: Evan Zamir I am trying to use the GLM regression with a Tweedie distribution so I can model insurance use cases. I have set up a very simple example adapted from the docs: {code:python} def create_fake_losses_data(self): df = self._spark.createDataFrame([ ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)), ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)), ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)), ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", "label", "offset", "weight", "features"]) logging.info(df.collect()) setattr(self, 'fake_data', df) try: glr = GeneralizedLinearRegression( family="tweedie", variancePower=1.5, linkPower=-1, offsetCol='offset') glr.setRegParam(0.3) model = glr.fit(df) logging.info(model) except Py4JJavaError as e: print(e) return self {code} This causes the following error: *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString. : java.util.NoSuchElementException: Failed to find a default value for link* at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) I was under the assumption that the default value for link is None, if not defined otherwise. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken
[ https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482080#comment-17482080 ] Thejdeep Gudivada commented on SPARK-38026: --- Duplicate of https://issues.apache.org/jira/browse/SPARK-35087 > Sorting in Executors summary table in Stages Page is broken > --- > > Key: SPARK-38026 > URL: https://issues.apache.org/jira/browse/SPARK-38026 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2 >Reporter: Thejdeep Gudivada >Priority: Major > Attachments: image (5).png > > > Sorting of certain columns in the Executors Summary table in the Stages Page > is broken as it ignores the size units in the field value. > For example, shown in the attachment, sorting the Input Size / Records column > in a decreasing order shows the error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken
[ https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejdeep Gudivada resolved SPARK-38026. --- Resolution: Duplicate > Sorting in Executors summary table in Stages Page is broken > --- > > Key: SPARK-38026 > URL: https://issues.apache.org/jira/browse/SPARK-38026 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2 >Reporter: Thejdeep Gudivada >Priority: Major > Attachments: image (5).png > > > Sorting of certain columns in the Executors Summary table in the Stages Page > is broken as it ignores the size units in the field value. > For example, shown in the attachment, sorting the Input Size / Records column > in a decreasing order shows the error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken
[ https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejdeep Gudivada updated SPARK-38026: -- Description: Sorting of certain columns in the Executors Summary table in the Stages Page is broken as it ignores the size units in the field value. For example, shown in the attachment, sorting the Input Size / Records column in a decreasing order shows the error. was: Sorting of certain columns in the Executors Summary table in the Stages Page is broken as it ignores the size units in the field value. For example, sorting the Input Size / Records column in a decreasing order shows the error. !image-2022-01-25-11-47-46-201.png! > Sorting in Executors summary table in Stages Page is broken > --- > > Key: SPARK-38026 > URL: https://issues.apache.org/jira/browse/SPARK-38026 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2 >Reporter: Thejdeep Gudivada >Priority: Major > Attachments: image (5).png > > > Sorting of certain columns in the Executors Summary table in the Stages Page > is broken as it ignores the size units in the field value. > For example, shown in the attachment, sorting the Input Size / Records column > in a decreasing order shows the error. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken
Thejdeep Gudivada created SPARK-38026: - Summary: Sorting in Executors summary table in Stages Page is broken Key: SPARK-38026 URL: https://issues.apache.org/jira/browse/SPARK-38026 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.2 Reporter: Thejdeep Gudivada Attachments: image (5).png Sorting of certain columns in the Executors Summary table in the Stages Page is broken as it ignores the size units in the field value. For example, sorting the Input Size / Records column in a decreasing order shows the error. !image-2022-01-25-11-47-46-201.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken
[ https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejdeep Gudivada updated SPARK-38026: -- Attachment: image (5).png > Sorting in Executors summary table in Stages Page is broken > --- > > Key: SPARK-38026 > URL: https://issues.apache.org/jira/browse/SPARK-38026 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2 >Reporter: Thejdeep Gudivada >Priority: Major > Attachments: image (5).png > > > Sorting of certain columns in the Executors Summary table in the Stages Page > is broken as it ignores the size units in the field value. > For example, sorting the Input Size / Records column in a decreasing order > shows the error. > !image-2022-01-25-11-47-46-201.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34372) Speculation results in broken CSV files in Amazon S3
[ https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482041#comment-17482041 ] Attila Zsolt Piros commented on SPARK-34372: hi [~daeheh]! Please look around here: https://spark.apache.org/docs/3.2.0/cloud-integration.html and switch to s3a. > Speculation results in broken CSV files in Amazon S3 > > > Key: SPARK-34372 > URL: https://issues.apache.org/jira/browse/SPARK-34372 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.7 > Environment: Amazon EMR with AMI version 5.32.0 >Reporter: Daehee Han >Priority: Minor > Labels: csv, s3, spark, speculation > > Hi, we've been experiencing some rows get corrupted while partitioned CSV > files were written to Amazon S3. Some records were found broken without any > error on Spark. Digging into the root cause, we found out Spark speculation > tried to upload a partition being uploaded slowly and ended up uploading only > a part of the partition, letting broken data uploaded to S3. > Here're stacktraces we've found. There are two executor involved - A: the > first executor which tried to upload the file, but it took much longer than > other executor (but still succeeded), which made spark speculation cut in and > kick off another executor B. Executor B started to upload the file too, but > was interrupted during uploading (killed: another attempt succeeded), and > ended up uploading only a part of the whole file. You can see in the log, the > file executor A uploaded (8461990 bytes originally) was overwritten by > executor B (uploaded only 3145728 bytes). > > Executor A: > {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID > 13201) > 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty > blocks including 10 local blocks and 460 remote blocks > 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote > fetches in 18 ms > 21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > true > 21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED > 21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class > 21/01/28 17:22:21 INFO INFO CSEMultipartUploadOutputStream: close > closed:false > s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv > 21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart > upload of 1 parts 8461990 bytes > 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading > \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. > Elapsed seconds: 10. > 21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of > task because needsTaskCommit=false: > attempt_20210128172219_0045_m_000426_13201 > 21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID > 13201). 8782 bytes result sent to driver > {quote} > Executor B: > {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task > 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID > 13245) > 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty > blocks including 11 local blocks and 459 remote blocks > 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote > fetches in 2 ms > 21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > true > 21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED > 21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer > class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter > 21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in > stage 45.0 (TID 13245), reason: another attempt succeeded > 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false > s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv > 21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart > upload of 1 parts 3145728 bytes > 21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading > \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. > Elapsed seconds: 0. > 21/01/28 17:22:32 ERROR Utils: Aborting task > com.univocity.parsers.common.TextWritingException: Error writing row. > Internal state
[jira] [Commented] (SPARK-38025) Improve test suite ExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481992#comment-17481992 ] Apache Spark commented on SPARK-38025: -- User 'khalidmammadov' has created a pull request for this issue: https://github.com/apache/spark/pull/35323 > Improve test suite ExternalCatalogSuite > --- > > Key: SPARK-38025 > URL: https://issues.apache.org/jira/browse/SPARK-38025 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.3 >Reporter: Khalid Mammadov >Priority: Minor > > Test suite *ExternalCatalogSuite.scala* can be optimized by removing > repetitive code by replacing them with already available utility function > with some minor changes. This will reduce redundant code, simplify the suite > and improve readability. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38025) Improve test suite ExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38025: Assignee: Apache Spark > Improve test suite ExternalCatalogSuite > --- > > Key: SPARK-38025 > URL: https://issues.apache.org/jira/browse/SPARK-38025 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.3 >Reporter: Khalid Mammadov >Assignee: Apache Spark >Priority: Minor > > Test suite *ExternalCatalogSuite.scala* can be optimized by removing > repetitive code by replacing them with already available utility function > with some minor changes. This will reduce redundant code, simplify the suite > and improve readability. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38025) Improve test suite ExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38025: Assignee: (was: Apache Spark) > Improve test suite ExternalCatalogSuite > --- > > Key: SPARK-38025 > URL: https://issues.apache.org/jira/browse/SPARK-38025 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.3 >Reporter: Khalid Mammadov >Priority: Minor > > Test suite *ExternalCatalogSuite.scala* can be optimized by removing > repetitive code by replacing them with already available utility function > with some minor changes. This will reduce redundant code, simplify the suite > and improve readability. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite
[ https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38022: -- Fix Version/s: 3.2.2 (was: 3.2.1) > Use relativePath for K8s remote file test in BasicTestsSuite > > > Key: SPARK-38022 > URL: https://issues.apache.org/jira/browse/SPARK-38022 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > *BEFORE* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 > minutes, 3 seconds) > [info] The code passed to eventually never returned normally. Attempted 190 > times over 3.01226506667 minutes. Last failure message: false was not > true. (KubernetesSuite.scala:452) > ... {code} > *AFTER* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 > milliseconds){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite
[ https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38022. --- Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 35318 [https://github.com/apache/spark/pull/35318] > Use relativePath for K8s remote file test in BasicTestsSuite > > > Key: SPARK-38022 > URL: https://issues.apache.org/jira/browse/SPARK-38022 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.1 > > > *BEFORE* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 > minutes, 3 seconds) > [info] The code passed to eventually never returned normally. Attempted 190 > times over 3.01226506667 minutes. Last failure message: false was not > true. (KubernetesSuite.scala:452) > ... {code} > *AFTER* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 > milliseconds){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38025) Improve test suite ExternalCatalogSuite
Khalid Mammadov created SPARK-38025: --- Summary: Improve test suite ExternalCatalogSuite Key: SPARK-38025 URL: https://issues.apache.org/jira/browse/SPARK-38025 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.3 Reporter: Khalid Mammadov Test suite *ExternalCatalogSuite.scala* can be optimized by removing repetitive code by replacing them with already available utility function with some minor changes. This will reduce redundant code, simplify the suite and improve readability. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38024) add support for INFORMATION_SCHEMA or other catalog variant
Stephen Wilcoxon created SPARK-38024: Summary: add support for INFORMATION_SCHEMA or other catalog variant Key: SPARK-38024 URL: https://issues.apache.org/jira/browse/SPARK-38024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Stephen Wilcoxon The ability to query the metadata (from SQL) can be extremely useful. There are ways to get at the metadata via python/scala/whatever but not from within SQL. Given that this is a widely adapted part of SQL92, it seems like a major omission not to be supported in Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481935#comment-17481935 ] Stephen Wilcoxon commented on SPARK-16452: -- When will this be reexamined? The ability to query the metadata (from SQL) can be extremely useful. There are ways to get at the metadata via python/scala/whatever but not from within Spark SQL. > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481887#comment-17481887 ] Saikrishna Pujari commented on SPARK-38004: --- [~itholic] I suppose we are going to address this as a documented improvement to add a note that the case sensitive columns are considered as different columns and we get ambiguity issues. same case columns will be handled part of mangle_dupe_cols() > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikrishna Pujari updated SPARK-38004: -- Description: mangle_dupe_cols - default is True So ideally it should have handled duplicate columns, but in case the columns are case sensitive it fails as below. AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be Sheet.col, Sheet.col. Where two columns are Col and cOL In the best practices, there is a mention of not to use case sensitive columns - [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] Either the docs for read_excel/mangle_dupe_cols have to be updated about this or it has to be handled. was: mangle_dupe_cols - default is True So ideally it should have handled duplicate columns, but in case the columns are case sensitive it fails as below. AnalysisException: Reference '{{{}Sheet.col1{}}}' is ambiguous, could be Sheet.col1, Sheet.col1. Where two columns are Col and cOL In the best practices, there is a mention of not to use case sensitive columns - [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] Either the docs for read_excel/mangle_dupe_cols have to be updated about this or it has to be handled. > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikrishna Pujari updated SPARK-38004: -- Issue Type: Documentation (was: Bug) > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col1{}}}' is ambiguous, could be > Sheet.col1, Sheet.col1. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikrishna Pujari updated SPARK-38004: -- Priority: Minor (was: Major) > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col1{}}}' is ambiguous, could be > Sheet.col1, Sheet.col1. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37479. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35202 [https://github.com/apache/spark/pull/35202] > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37479: --- Assignee: dch nguyen > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37999) Spark executor self-exiting due to driver disassociated in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-37999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petri updated SPARK-37999: -- Description: I have Spark driver running in a Kubernetes pod with client deploy-mode.I have created a headless K8S service with name 'lola' at port 7077 which targets the driver pod. Driver pod will launch successfully and tries to start an executor, but eventually the executor will fail with error: {code:java} Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! Shutting down.{code} Then driver stays up and running and will attempt to start another executor which fails with same error and this goes on and on, driver spawning new failing executors. In the driver pod, I see only following errors (when using 'grep ERROR'): {code:java} 22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105: 22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106: 22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: The executor with ID 7 (registered at 1643031697505 ms) was not found in the cluster at the polling time (1643031731509 ms) which is after the accepted detect delta time (3 ms) configured by `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been deleted but the driver missed the deletion event. Marking this executor as failed. 22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103: 22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.50.220:{code} Full log from the executor: {code:java} + source /opt/spark/bin/common.sh + cp /etc/group /tmp/group + cp /etc/passwd /tmp/passwd ++ id -u + myuid=1501 ++ id -g + mygid=0 + myuname=cspk + fsgid= + fsgrpname=cspk + set +e ++ getent passwd 1501 + uidentry= ++ cat /etc/machine-id cat: /etc/machine-id: No such file or directory + export SYSTEMID= + SYSTEMID= + set -e + '[' -z '' ']' + '[' -w /tmp/group ']' + echo cspk:x:: + cp /etc/passwd /tmp/passwd.template + '[' -z '' ']' + '[' -w /tmp/passwd.template ']' + echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false' + envsubst + export LD_PRELOAD=/usr/lib64/libnss_wrapper.so + LD_PRELOAD=/usr/lib64/libnss_wrapper.so + export NSS_WRAPPER_PASSWD=/tmp/passwd + NSS_WRAPPER_PASSWD=/tmp/passwd + export NSS_WRAPPER_GROUP=/tmp/group + NSS_WRAPPER_GROUP=/tmp/group + SPARK_K8S_CMD=executor + case "$SPARK_K8S_CMD" in + shift 1 + SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + env + sort -t_ -k4 -n + grep SPARK_AUTH_OPT_ + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_AUTH_OPTS + env + grep SPARK_NET_CRYPTO_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_NET_CRYPTO_OPTS + '[' -n '' ']' + '[' -z ']' + set +x TLS Not enabled for WebServer + CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" "${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP) + exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java -Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 -Dlog4j.configurationFile=http://192.168.80.89:/log4j2.xml -Dlog4j.configuration=http://192.168.80.89:/log4j2.xml -Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m -Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@lola.mni-system:7077 --executor-id 10 --cores 3 --app-id spark-application-1643031611044 --hostname 192.168.82.121 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release {"type":"log",
[jira] [Updated] (SPARK-37999) Spark executor self-exiting due to driver disassociated in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-37999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petri updated SPARK-37999: -- Description: I have Spark driver running in a Kubernetes pod with client deploy-mode.I have created a headless K8S service with name 'lola' at port 7077 which targets the driver pod. Driver pod will launch successfully and tries to start an executor, but eventually the executor will fail with error: {code:java} Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! Shutting down.{code} Then driver stays up and running and will attempt to start another executor which fails with same error and this goes on and on, driver spawning new failing executors. In the driver pod, I see only following errors (when using 'grep ERROR'): {code:java} 22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105: 22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106: 22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: The executor with ID 7 (registered at 1643031697505 ms) was not found in the cluster at the polling time (1643031731509 ms) which is after the accepted detect delta time (3 ms) configured by `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been deleted but the driver missed the deletion event. Marking this executor as failed. 22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103: 22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.50.220:{code} Full log from the executor: {code:java} + source /opt/spark/bin/common.sh + cp /etc/group /tmp/group + cp /etc/passwd /tmp/passwd ++ id -u + myuid=1501 ++ id -g + mygid=0 + myuname=cspk + fsgid= + fsgrpname=cspk + set +e ++ getent passwd 1501 + uidentry= ++ cat /etc/machine-id cat: /etc/machine-id: No such file or directory + export SYSTEMID= + SYSTEMID= + set -e + '[' -z '' ']' + '[' -w /tmp/group ']' + echo cspk:x:: + cp /etc/passwd /tmp/passwd.template + '[' -z '' ']' + '[' -w /tmp/passwd.template ']' + echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false' + envsubst + export LD_PRELOAD=/usr/lib64/libnss_wrapper.so + LD_PRELOAD=/usr/lib64/libnss_wrapper.so + export NSS_WRAPPER_PASSWD=/tmp/passwd + NSS_WRAPPER_PASSWD=/tmp/passwd + export NSS_WRAPPER_GROUP=/tmp/group + NSS_WRAPPER_GROUP=/tmp/group + SPARK_K8S_CMD=executor + case "$SPARK_K8S_CMD" in + shift 1 + SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + env + sort -t_ -k4 -n + grep SPARK_AUTH_OPT_ + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_AUTH_OPTS + env + grep SPARK_NET_CRYPTO_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_NET_CRYPTO_OPTS + '[' -n '' ']' + '[' -z ']' + set +x TLS Not enabled for WebServer + CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" "${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP) + exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java -Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 -Dlog4j.configurationFile=http://192.168.80.89:/log4j2.xml -Dlog4j.configuration=http://192.168.80.89:/log4j2.xml -Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m -Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@lola.mni-system:7077 --executor-id 10 --cores 3 --app-id spark-application-1643031611044 --hostname 192.168.82.121 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release {"type":"log",
[jira] [Assigned] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
[ https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38023: Assignee: (was: Apache Spark) > ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as > finished > > > Key: SPARK-38023 > URL: https://issues.apache.org/jira/browse/SPARK-38023 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org