date:20220125

[jira] [Commented] (SPARK-36476) cloudpickle: ValueError: Cell is empty

2022-01-25 Thread Pedro Larroy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482308#comment-17482308
 ] 

Pedro Larroy commented on SPARK-36476:
--

This seems to happen as an interaction with the package "dill" and only in 
Python 3.7

 

This was explained here and I verified the reproduction in my codebas: 
https://stackoverflow.com/questions/69360462/conflict-between-dill-and-pickle-while-using-pyspark

> cloudpickle: ValueError: Cell is empty
> --
>
> Key: SPARK-36476
> URL: https://issues.apache.org/jira/browse/SPARK-36476
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Oliver Mannion
>Priority: Major
>
> {code:java}
>   File 
> "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/serializers.py",
>  line 437, in dumps
> return cloudpickle.dumps(obj, pickle_protocol)
>   File 
> "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py",
>  line 101, in dumps
> cp.dump(obj)
>   File 
> "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py",
>  line 540, in dump
> return Pickler.dump(self, obj)
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 437, in dump
> self.save(obj)
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 504, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 789, in save_tuple
> save(element)
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 504, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py",
>  line 722, in save_function
> *self._dynamic_function_reduce(obj), obj=obj
>   File 
> "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/pyspark/cloudpickle/cloudpickle_fast.py",
>  line 659, in _save_reduce_pickle5
> dictitems=dictitems, obj=obj
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 638, in save_reduce
> save(args)
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 504, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 789, in save_tuple
> save(element)
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 504, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 774, in save_tuple
> save(element)
>   File "/Users/tekumara/.pyenv/versions/3.7.9/lib/python3.7/pickle.py", line 
> 504, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/Users/tekumara/code/awesome-spark-app/.venv/lib/python3.7/site-packages/dill/_dill.py",
>  line 1226, in save_cell
> f = obj.cell_contents
> ValueError: Cell is empty
> {code}
> Doesn't occur in Spark 3.0.0, so possibly introduced when cloudpickle was 
> upgraded to 1.5.0 (see https://issues.apache.org/jira/browse/SPARK-32094).
> Also doesn't occur in Spark 3.1.2 with python 3.8.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482294#comment-17482294
 ] 

Apache Spark commented on SPARK-33326:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35329

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33326:


Assignee: Apache Spark

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Assignee: Apache Spark
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33326:


Assignee: (was: Apache Spark)

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38032.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35331
[https://github.com/apache/spark/pull/35331]

> Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
> -
>
> Key: SPARK-38032
> URL: https://issues.apache.org/jira/browse/SPARK-38032
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> We should better test latest PyArrow version. Now 6.0.1 is release but we're 
> using < 5.0.0 for 
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38032:


Assignee: Hyukjin Kwon

> Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
> -
>
> Key: SPARK-38032
> URL: https://issues.apache.org/jira/browse/SPARK-38032
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> We should better test latest PyArrow version. Now 6.0.1 is release but we're 
> using < 5.0.0 for 
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38031.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35330
[https://github.com/apache/spark/pull/35330]

> Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, 
> Python 3.9)
> -
>
> Key: SPARK-38031
> URL: https://issues.apache.org/jira/browse/SPARK-38031
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38031:


Assignee: Hyukjin Kwon

> Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, 
> Python 3.9)
> -
>
> Key: SPARK-38031
> URL: https://issues.apache.org/jira/browse/SPARK-38031
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37946) Use error classes in the execution errors related to partitions

2022-01-25 Thread Yuto Akutsu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482281#comment-17482281
 ] 

Yuto Akutsu commented on SPARK-37946:
-

[~maxgekk] I will work on this.

> Use error classes in the execution errors related to partitions
> ---
>
> Key: SPARK-37946
> URL: https://issues.apache.org/jira/browse/SPARK-37946
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * unableToDeletePartitionPathError
> * unableToCreatePartitionPathError
> * unableToRenamePartitionPathError
> * notADatasourceRDDPartitionError
> * cannotClearPartitionDirectoryError
> * failedToCastValueToDataTypeForPartitionColumnError
> * unsupportedPartitionTransformError
> * cannotCreateJDBCTableWithPartitionsError
> * requestedPartitionsMismatchTablePartitionsError
> * dynamicPartitionKeyNotAmongWrittenPartitionPathsError
> * cannotRemovePartitionDirError
> * alterTableWithDropPartitionAndPurgeUnsupportedError
> * invalidPartitionFilterError
> * getPartitionMetadataByFilterError
> * illegalLocationClauseForViewPartitionError
> * partitionColumnNotFoundInSchemaError
> * cannotAddMultiPartitionsOnNonatomicPartitionTableError
> * cannotDropMultiPartitionsOnNonatomicPartitionTableError
> * truncateMultiPartitionUnsupportedError
> * dynamicPartitionOverwriteUnsupportedByTableError
> * writePartitionExceedConfigSizeWhenDynamicPartitionError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37937) Use error classes in the parsing errors of lateral join

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482278#comment-17482278
 ] 

Apache Spark commented on SPARK-37937:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/35328

> Use error classes in the parsing errors of lateral join
> ---
>
> Key: SPARK-37937
> URL: https://issues.apache.org/jira/browse/SPARK-37937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * lateralJoinWithNaturalJoinUnsupportedError
> * lateralJoinWithUsingJoinUnsupportedError
> * unsupportedLateralJoinTypeError
> * invalidLateralJoinRelationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37937) Use error classes in the parsing errors of lateral join

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37937:


Assignee: Apache Spark

> Use error classes in the parsing errors of lateral join
> ---
>
> Key: SPARK-37937
> URL: https://issues.apache.org/jira/browse/SPARK-37937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * lateralJoinWithNaturalJoinUnsupportedError
> * lateralJoinWithUsingJoinUnsupportedError
> * unsupportedLateralJoinTypeError
> * invalidLateralJoinRelationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37937) Use error classes in the parsing errors of lateral join

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37937:


Assignee: (was: Apache Spark)

> Use error classes in the parsing errors of lateral join
> ---
>
> Key: SPARK-37937
> URL: https://issues.apache.org/jira/browse/SPARK-37937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * lateralJoinWithNaturalJoinUnsupportedError
> * lateralJoinWithUsingJoinUnsupportedError
> * unsupportedLateralJoinTypeError
> * invalidLateralJoinRelationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37937) Use error classes in the parsing errors of lateral join

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482277#comment-17482277
 ] 

Apache Spark commented on SPARK-37937:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/35328

> Use error classes in the parsing errors of lateral join
> ---
>
> Key: SPARK-37937
> URL: https://issues.apache.org/jira/browse/SPARK-37937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * lateralJoinWithNaturalJoinUnsupportedError
> * lateralJoinWithUsingJoinUnsupportedError
> * unsupportedLateralJoinTypeError
> * invalidLateralJoinRelationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38030:


Assignee: (was: Apache Spark)

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Priority: Major
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
>   at

[jira] [Commented] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482271#comment-17482271
 ] 

Apache Spark commented on SPARK-38030:
--

User 'shardulm94' has created a pull request for this issue:
https://github.com/apache/spark/pull/35332

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Priority: Major
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at

[jira] [Assigned] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38030:


Assignee: Apache Spark

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Apache Spark
>Priority: Major
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
>   at

[jira] [Assigned] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38032:


Assignee: (was: Apache Spark)

> Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
> -
>
> Key: SPARK-38032
> URL: https://issues.apache.org/jira/browse/SPARK-38032
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.3
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should better test latest PyArrow version. Now 6.0.1 is release but we're 
> using < 5.0.0 for 
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482242#comment-17482242
 ] 

Apache Spark commented on SPARK-38032:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35331

> Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
> -
>
> Key: SPARK-38032
> URL: https://issues.apache.org/jira/browse/SPARK-38032
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.3
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should better test latest PyArrow version. Now 6.0.1 is release but we're 
> using < 5.0.0 for 
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38032:


Assignee: Apache Spark

> Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
> -
>
> Key: SPARK-38032
> URL: https://issues.apache.org/jira/browse/SPARK-38032
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.3
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> We should better test latest PyArrow version. Now 6.0.1 is release but we're 
> using < 5.0.0 for 
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-01-25 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-38032:


 Summary: Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
 Key: SPARK-38032
 URL: https://issues.apache.org/jira/browse/SPARK-38032
 Project: Spark
  Issue Type: Test
  Components: PySpark, SQL
Affects Versions: 3.3
Reporter: Hyukjin Kwon


We should better test latest PyArrow version. Now 6.0.1 is release but we're 
using < 5.0.0 for 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482239#comment-17482239
 ] 

Apache Spark commented on SPARK-38031:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35330

> Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, 
> Python 3.9)
> -
>
> Key: SPARK-38031
> URL: https://issues.apache.org/jira/browse/SPARK-38031
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38031:


Assignee: (was: Apache Spark)

> Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, 
> Python 3.9)
> -
>
> Key: SPARK-38031
> URL: https://issues.apache.org/jira/browse/SPARK-38031
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38031:


Assignee: Apache Spark

> Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, 
> Python 3.9)
> -
>
> Key: SPARK-38031
> URL: https://issues.apache.org/jira/browse/SPARK-38031
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482237#comment-17482237
 ] 

Apache Spark commented on SPARK-38031:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35330

> Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, 
> Python 3.9)
> -
>
> Key: SPARK-38031
> URL: https://issues.apache.org/jira/browse/SPARK-38031
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38003) Differentiate scalar and table function lookup in LookupFunctions

2022-01-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38003:
---

Assignee: Allison Wang

> Differentiate scalar and table function lookup in LookupFunctions
> -
>
> Key: SPARK-38003
> URL: https://issues.apache.org/jira/browse/SPARK-38003
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Currently, the LookupFunctions rule looks up unresolved scalar functions 
> using both the scalar function registry and the table function registry. We 
> should differentiate scalar and table function lookup in the Analyzer rule 
> LookupFunctions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38003) Differentiate scalar and table function lookup in LookupFunctions

2022-01-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38003.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35304
[https://github.com/apache/spark/pull/35304]

> Differentiate scalar and table function lookup in LookupFunctions
> -
>
> Key: SPARK-38003
> URL: https://issues.apache.org/jira/browse/SPARK-38003
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, the LookupFunctions rule looks up unresolved scalar functions 
> using both the scalar function registry and the table function registry. We 
> should differentiate scalar and table function lookup in the Analyzer rule 
> LookupFunctions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38031) Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)

2022-01-25 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-38031:


 Summary: Update document type conversion for Pandas UDFs (pyarrow 
6.0.1, pandas 1.4.0, Python 3.9)
 Key: SPARK-38031
 URL: https://issues.apache.org/jira/browse/SPARK-38031
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Update the chart generated by SPARK-32722.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-01-25 Thread Shardul Mahadik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482233#comment-17482233
 ] 

Shardul Mahadik commented on SPARK-38030:
-

I plan to create a PR to change the canonicalization behavior of {{Cast}} so 
that nullability information is removed from the target data type of {{Cast}} 
during canonicalization. However, the canonicalization implementation has 
changed drastically between Spark 3.1.1 and master, so I will probably create 
two PRs, 1 for master, 1 for branch-3.1.

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Priority: Major
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at

[jira] [Resolved] (SPARK-37948) Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37948.
--
Resolution: Won't Fix

> Disable mapreduce.fileoutputcommitter.algorithm.version=2 by default
> 
>
> Key: SPARK-37948
> URL: https://issues.apache.org/jira/browse/SPARK-37948
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: hujiahua
>Priority: Major
>
> The hadoop MR v2 commit algorithm had a correctness issue described by 
> SPARK-33019, and changed 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default. 
> But some spark users like me ware unaware of this correctness issue before 
> and had used v2 commit algorithm in spark 2.x for performance purposes. And 
> after upgrade to spark 3.x, we encountered this correctness issue in 
> production environment, caused a very serious failure.The trigger probability 
> of this issue was higher in new version spark 3.x, and I didn't delve into 
> the specific reasons. So I propose we should better disable 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 by default, if 
> users using v2 commit algorithm, then fail the job and warn users this 
> correctness issue. Or users can choose to force the v2 usage through a new 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-01-25 Thread Shardul Mahadik (Jira)

Shardul Mahadik created SPARK-38030:
---

 Summary: Query with cast containing non-nullable columns fails 
with AQE on Spark 3.1.1
 Key: SPARK-38030
 URL: https://issues.apache.org/jira/browse/SPARK-38030
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Shardul Mahadik


One of our user queries failed in Spark 3.1.1 when using AQE with the following 
stacktrace mentioned below (some parts of the plan have been redacted, but the 
structure is preserved).

Debugging this issue, we found that the failure was within AQE calling 
[QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].

The query contains a cast over a column with non-nullable struct fields. 
Canonicalization [removes nullability 
information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
 from the child {{AttributeReference}} of the Cast, however it does not remove 
nullability information from the Cast's target dataType. This causes the 
[checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
 to return false because the child is now nullable and cast target data type is 
not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
{code:java}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
+- Union
   :- Project [cast(columnA#30) as struct<...>]
   :  +- BatchScan[columnA#30] hive.tbl 
   +- Project [cast(columnA#35) as struct<...>]
  +- BatchScan[columnA#35] hive.tbl

  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
  at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
  at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
  at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at

[jira] [Assigned] (SPARK-33328) Fix Flaky HiveThriftHttpServerSuite

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33328:


Assignee: Apache Spark

> Fix Flaky HiveThriftHttpServerSuite
> ---
>
> Key: SPARK-33328
> URL: https://issues.apache.org/jira/browse/SPARK-33328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
> Attachments: failure_rate.png
>
>
> After launching successfully `HiveThriftServer2 started successfully`, the 
> test fails due to 500 error.
> The failure rate is over 50%. (This is an example of the test case `JDBC 
> query execution` in that suite)
>  !failure_rate.png|width=508,height=321!
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1516/testReport/]
> {code:java}
> 09:58:03.853 pool-1-thread-1 INFO HiveThriftHttpServerSuite: Trying to start 
> HiveThriftServer2: port=14541, mode=http, attempt=0
> 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: COMMAND: 
> WrappedArray(../../sbin/start-thriftserver.sh, --master, local, --hiveconf, 
> javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-6f4abc35-f09c-46e6-b6eb-8a310d557e28;create=true,
>  --hiveconf, 
> hive.metastore.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-2329343a-3ad4-4bfd-943f-6b46984848b8,
>  --hiveconf, hive.server2.thrift.bind.host=localhost, --hiveconf, 
> hive.server2.transport.mode=http, --hiveconf, 
> hive.server2.logging.operation.log.location=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-77f3b359-1553-40e3-9d75-35c46d2d4d46,
>  --hiveconf, 
> hive.exec.local.scratchdir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-8923e61f-36da-4930-b035-6eb3712d41ab,
>  --hiveconf, hive.server2.thrift.http.port=14541, --driver-class-path, 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-d54a7073-2f02-4331-a84d-bbb3b50a47ac,
>  --driver-java-options, -Dlog4j.debug, --conf, spark.ui.enabled=false)
> 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: OUTPUT: starting 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/logs/spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-06.out
> 09:58:38.688 pool-1-thread-1 INFO HiveThriftHttpServerSuite: 
> HiveThriftServer2 started successfully
> 09:58:38.689 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> HiveThriftHttpServerSuite:
> = TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.HiveThriftHttpServerSuite: 
> 'JDBC query execution' =
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> Utils: Supplied authorities: localhost:14541
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: * JDBC param deprecation *
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: The use of hive.server2.transport.mode is deprecated.
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: Please use transportMode like so: 
> jdbc:hive2://:/dbName;transportMode=
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: * JDBC param deprecation *
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: The use of hive.server2.thrift.http.path is deprecated.
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: Please use httpPath like so: 
> jdbc:hive2://:/dbName;httpPath=
> 09:58:38.692 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> Utils: Resolved authority: localhost:14541
> 09:58:38.818 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG RequestAddCookies: CookieSpec selected: default
> 09:58:38.830 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG RequestAuthCache: Auth cache not set in the context
> 09:58:38.832 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG PoolingHttpClientConnectionManager: Connection request: [route: 
> {}->http://localhost:14541][total available: 0; route allocated: 0 of 2; 
> total allocated: 0 of 20]
> 09:58:38.846 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: 
> {}->http://localhost:14541][total available: 0; route allocated: 1 of 2; 
> total

[jira] [Commented] (SPARK-33328) Fix Flaky HiveThriftHttpServerSuite

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482215#comment-17482215
 ] 

Apache Spark commented on SPARK-33328:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35329

> Fix Flaky HiveThriftHttpServerSuite
> ---
>
> Key: SPARK-33328
> URL: https://issues.apache.org/jira/browse/SPARK-33328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: failure_rate.png
>
>
> After launching successfully `HiveThriftServer2 started successfully`, the 
> test fails due to 500 error.
> The failure rate is over 50%. (This is an example of the test case `JDBC 
> query execution` in that suite)
>  !failure_rate.png|width=508,height=321!
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1516/testReport/]
> {code:java}
> 09:58:03.853 pool-1-thread-1 INFO HiveThriftHttpServerSuite: Trying to start 
> HiveThriftServer2: port=14541, mode=http, attempt=0
> 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: COMMAND: 
> WrappedArray(../../sbin/start-thriftserver.sh, --master, local, --hiveconf, 
> javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-6f4abc35-f09c-46e6-b6eb-8a310d557e28;create=true,
>  --hiveconf, 
> hive.metastore.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-2329343a-3ad4-4bfd-943f-6b46984848b8,
>  --hiveconf, hive.server2.thrift.bind.host=localhost, --hiveconf, 
> hive.server2.transport.mode=http, --hiveconf, 
> hive.server2.logging.operation.log.location=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-77f3b359-1553-40e3-9d75-35c46d2d4d46,
>  --hiveconf, 
> hive.exec.local.scratchdir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-8923e61f-36da-4930-b035-6eb3712d41ab,
>  --hiveconf, hive.server2.thrift.http.port=14541, --driver-class-path, 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-d54a7073-2f02-4331-a84d-bbb3b50a47ac,
>  --driver-java-options, -Dlog4j.debug, --conf, spark.ui.enabled=false)
> 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: OUTPUT: starting 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/logs/spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-06.out
> 09:58:38.688 pool-1-thread-1 INFO HiveThriftHttpServerSuite: 
> HiveThriftServer2 started successfully
> 09:58:38.689 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> HiveThriftHttpServerSuite:
> = TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.HiveThriftHttpServerSuite: 
> 'JDBC query execution' =
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> Utils: Supplied authorities: localhost:14541
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: * JDBC param deprecation *
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: The use of hive.server2.transport.mode is deprecated.
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: Please use transportMode like so: 
> jdbc:hive2://:/dbName;transportMode=
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: * JDBC param deprecation *
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: The use of hive.server2.thrift.http.path is deprecated.
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: Please use httpPath like so: 
> jdbc:hive2://:/dbName;httpPath=
> 09:58:38.692 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> Utils: Resolved authority: localhost:14541
> 09:58:38.818 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG RequestAddCookies: CookieSpec selected: default
> 09:58:38.830 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG RequestAuthCache: Auth cache not set in the context
> 09:58:38.832 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG PoolingHttpClientConnectionManager: Connection request: [route: 
> {}->http://localhost:14541][total available: 0; route allocated: 0 of 2; 
> total allocated: 0 of 20]
> 09:58:38.846 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: 
>

[jira] [Assigned] (SPARK-33328) Fix Flaky HiveThriftHttpServerSuite

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33328:


Assignee: (was: Apache Spark)

> Fix Flaky HiveThriftHttpServerSuite
> ---
>
> Key: SPARK-33328
> URL: https://issues.apache.org/jira/browse/SPARK-33328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: failure_rate.png
>
>
> After launching successfully `HiveThriftServer2 started successfully`, the 
> test fails due to 500 error.
> The failure rate is over 50%. (This is an example of the test case `JDBC 
> query execution` in that suite)
>  !failure_rate.png|width=508,height=321!
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1516/testReport/]
> {code:java}
> 09:58:03.853 pool-1-thread-1 INFO HiveThriftHttpServerSuite: Trying to start 
> HiveThriftServer2: port=14541, mode=http, attempt=0
> 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: COMMAND: 
> WrappedArray(../../sbin/start-thriftserver.sh, --master, local, --hiveconf, 
> javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-6f4abc35-f09c-46e6-b6eb-8a310d557e28;create=true,
>  --hiveconf, 
> hive.metastore.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-2329343a-3ad4-4bfd-943f-6b46984848b8,
>  --hiveconf, hive.server2.thrift.bind.host=localhost, --hiveconf, 
> hive.server2.transport.mode=http, --hiveconf, 
> hive.server2.logging.operation.log.location=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-77f3b359-1553-40e3-9d75-35c46d2d4d46,
>  --hiveconf, 
> hive.exec.local.scratchdir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-8923e61f-36da-4930-b035-6eb3712d41ab,
>  --hiveconf, hive.server2.thrift.http.port=14541, --driver-class-path, 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/target/tmp/spark-d54a7073-2f02-4331-a84d-bbb3b50a47ac,
>  --driver-java-options, -Dlog4j.debug, --conf, spark.ui.enabled=false)
> 09:58:06.492 pool-1-thread-1 INFO HiveThriftHttpServerSuite: OUTPUT: starting 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/logs/spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-06.out
> 09:58:38.688 pool-1-thread-1 INFO HiveThriftHttpServerSuite: 
> HiveThriftServer2 started successfully
> 09:58:38.689 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> HiveThriftHttpServerSuite:
> = TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.HiveThriftHttpServerSuite: 
> 'JDBC query execution' =
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> Utils: Supplied authorities: localhost:14541
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: * JDBC param deprecation *
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: The use of hive.server2.transport.mode is deprecated.
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: Please use transportMode like so: 
> jdbc:hive2://:/dbName;transportMode=
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: * JDBC param deprecation *
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: The use of hive.server2.thrift.http.path is deprecated.
> 09:58:38.691 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite WARN 
> Utils: Please use httpPath like so: 
> jdbc:hive2://:/dbName;httpPath=
> 09:58:38.692 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite INFO 
> Utils: Resolved authority: localhost:14541
> 09:58:38.818 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG RequestAddCookies: CookieSpec selected: default
> 09:58:38.830 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG RequestAuthCache: Auth cache not set in the context
> 09:58:38.832 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG PoolingHttpClientConnectionManager: Connection request: [route: 
> {}->http://localhost:14541][total available: 0; route allocated: 0 of 2; 
> total allocated: 0 of 20]
> 09:58:38.846 pool-1-thread-1-ScalaTest-running-HiveThriftHttpServerSuite 
> DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 0][route: 
> {}->http://localhost:14541][total available: 0; route allocated: 1 of 2; 
> total allocated: 1 of 20]
>

[jira] [Resolved] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-01-25 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-38013.
---
Resolution: Won't Fix

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-01-25 Thread XiDuo You (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482209#comment-17482209
 ] 

XiDuo You commented on SPARK-38013:
---

seems it is allowed in AQE, not a bug otherwise ..

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-01-25 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38013:
--
Issue Type: Task  (was: Bug)

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)

2022-01-25 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-30062.

Fix Version/s: 3.2.2
   3.3
   Resolution: Fixed

> bug with DB2Driver using mode("overwrite") option("truncate",True)
> --
>
> Key: SPARK-30062
> URL: https://issues.apache.org/jira/browse/SPARK-30062
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Guy Huinen
>Priority: Major
>  Labels: db2, pyspark
> Fix For: 3.2.2, 3.3
>
>
> using DB2Driver using mode("overwrite") option("truncate",True) gives sql 
> error
>  
> {code:java}
> dfClient.write\
>  .format("jdbc")\
>  .mode("overwrite")\
>  .option('driver', 'com.ibm.db2.jcc.DB2Driver')\
>  .option("url","jdbc:db2://")\
>  .option("user","xxx")\
>  .option("password","")\
>  .option("dbtable","")\
>  .option("truncate",True)\{code}
>  
>  gives the error below
> in summary i belief the semicolon is misplaced or malformated
>  
> {code:java}
> EXPO.EXPO#CMR_STG;IMMEDIATE{code}
>  
>  
> full error
> {code:java}
> An error occurred while calling o47.save. : 
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, 
> SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, 
> DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at 
> com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) 
> at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at 
> com.ibm.db2.jcc.am.kh.d(kh.java:2776) at 
> com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) 
> at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) 
> at com.ibm.db2.jcc.t4.av.h(av.java:124) at 
> com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at 
> com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) 
> at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
>

[jira] [Commented] (SPARK-37858) Throw Spark exceptions from AES functions

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482197#comment-17482197
 ] 

Apache Spark commented on SPARK-37858:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/35328

> Throw Spark exceptions from AES functions
> -
>
> Key: SPARK-37858
> URL: https://issues.apache.org/jira/browse/SPARK-37858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, Spark SQL can throw Java exceptions from the 
> aes_encrypt()/aes_decrypt() functions, for instance:
> {code:java}
> java.lang.RuntimeException: javax.crypto.AEADBadTagException: Tag mismatch!
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:93)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesDecrypt(ExpressionImplUtils.java:43)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:354)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:136)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
>   at 
> com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620)
>   at 
> com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116)
>   at 
> com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053)
>   at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
>   at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
>   at javax.crypto.Cipher.doFinal(Cipher.java:2226)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:87)
>   ... 19 more
> {code}
> That might confuse non-Scala/Java users. Need to wrap such kind of exception 
> by Spark's exception.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37858) Throw Spark exceptions from AES functions

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482196#comment-17482196
 ] 

Apache Spark commented on SPARK-37858:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/35328

> Throw Spark exceptions from AES functions
> -
>
> Key: SPARK-37858
> URL: https://issues.apache.org/jira/browse/SPARK-37858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, Spark SQL can throw Java exceptions from the 
> aes_encrypt()/aes_decrypt() functions, for instance:
> {code:java}
> java.lang.RuntimeException: javax.crypto.AEADBadTagException: Tag mismatch!
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:93)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesDecrypt(ExpressionImplUtils.java:43)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:354)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:136)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
>   at 
> com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:620)
>   at 
> com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1116)
>   at 
> com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1053)
>   at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
>   at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
>   at javax.crypto.Cipher.doFinal(Cipher.java:2226)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionImplUtils.aesInternal(ExpressionImplUtils.java:87)
>   ... 19 more
> {code}
> That might confuse non-Scala/Java users. Need to wrap such kind of exception 
> by Spark's exception.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-01-25 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38013:
--
Summary: AQE can change bhj to smj if no extra shuffle introduce  (was: Fix 
AQE can change bhj to smj if no extra shuffle introduce)

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37995) TPCDS 1TB q72 fails when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false

2022-01-25 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482194#comment-17482194
 ] 

Hyukjin Kwon commented on SPARK-37995:
--

cc [~maryannxue] FYI

> TPCDS 1TB q72 fails when 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false
> 
>
> Key: SPARK-37995
> URL: https://issues.apache.org/jira/browse/SPARK-37995
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kapil Singh
>Priority: Major
> Attachments: full-stacktrace.txt
>
>
> TPCDS 1TB q72 fails in 3.2 Spark when 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false. We 
> have been running with this config in 3.1 as well and it worked fine in that 
> version. This used to add a subquery dpp in q72.
> Relevant stack trace
> {code:java}
> rror: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
> org.apache.spark.sql.execution.SparkPlan  at 
> scala.collection.immutable.List.map(List.scala:293)  at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:75)
>   at 
> org.apache.spark.sql.execution.SparkPlanInfo$.$anonfun$fromSparkPlan$3(SparkPlanInfo.scala:75)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> 
> 
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:75)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:708)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$2(AdaptiveSparkPlanExec.scala:239)
>   at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.java:23) 
>  at scala.Option.foreach(Option.scala:407)  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:239)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:226)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:365)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37996) Contribution guide is stale

2022-01-25 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482193#comment-17482193
 ] 

Hyukjin Kwon commented on SPARK-37996:
--

Hm, yeah. I think now we always run the tests by default once you push a commit 
to your forked repo so we won't need it anymore. Interested in submitting a PR? 
We should fix it at in https://github.com/apache/spark-website

> Contribution guide is stale
> ---
>
> Key: SPARK-37996
> URL: https://issues.apache.org/jira/browse/SPARK-37996
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Contribution guide mentions below link to use to test on local repo before 
> raising PR but the process has changed and documentation does not reflect it.
> https://spark.apache.org/developer-tools.html#github-workflow-tests
> Only digging into git log of " 
> [.github/workflows/build_and_test.yml|https://github.com/apache/spark/commit/2974b70d1efd4b1c5cfe7e2467766f0a9a1fec82#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2];
>  I managed to find what the new process is. It was changed in 
> [https://github.com/apache/spark/pull/32092] but documentation was not 
> updated.
> I am happy to contribute to fix it but apparently 
> [https://spark.apache.org/developer-tools.html] is hosted in Apache website 
> rather that in the Spark source code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37997) Allow query parameters to be passed into spark.read

2022-01-25 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482192#comment-17482192
 ] 

Hyukjin Kwon commented on SPARK-37997:
--

Can we just format it before passing to spark.read?

> Allow query parameters to be passed into spark.read
> ---
>
> Key: SPARK-37997
> URL: https://issues.apache.org/jira/browse/SPARK-37997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: QFW
>Priority: Minor
>
> This ticket is to allow query parameters to be used spark.read.
> While it is possible to inject some parameters into the query using string 
> concatenation, this doesn't work for all data types, for example binaries. In 
> this example, the parameter rowversion is a binary which needs to be passed 
> into the sql query. 
> {code:java}
> _select_sql = f'SELECT * FROM dbo.Table WHERE RowVersion > {rowversion}'
> df = spark.read.format("jdbc") \
>     .option("url", 
> "jdbc:sqlserver://databaseserver.database.windows.net;databaseName=databasename")
>  \
>     .option("query", _select_sql) \
>     .option("username", "sql_username") \
>     .option("password", "sql_password") \
>     .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
>     .load() {code}
> This results in the query looking this this...
> {code:java}
> SELECT * FROM dbo.Address WHERE RowVersion > 
> bytearray(b'\x00\x00\x00\x00\x02\xdf=\xf5') {code}
> As far as I know, there is no way to do this currently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38000) Sort node incorrectly removed from the optimized logical plan

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38000.
--
Resolution: Cannot Reproduce

> Sort node incorrectly removed from the optimized logical plan
> -
>
> Key: SPARK-38000
> URL: https://issues.apache.org/jira/browse/SPARK-38000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Tested on:
> Ubuntu 18.04.2 LTS
> OpenJDK 1.8.0_312 64-Bit Server VM (build 25.312-b07, mixed mode)
>Reporter: Antoine Wendlinger
>Priority: Major
>  Labels: correctness
>
> When using a fairly involved combination of joins, windows, cache and 
> orderBy, the sorting phase disappears from the optimized logical plan and the 
> resulting dataframe is not sorted.
> You can find a reproduction of the bug in 
> [https://github.com/antoinewdg/spark-bug-report|http://example.com/].
> Use {{sbt run}} to get the results.
> The bug is very niche, I chose to report it because it looks like a 
> correctness issue, and may be a symptom of a larger one.
> The bug affects only 3.2.0, tests on 3.1.2 show the result correctly sorted.
> As far as I could test it, all steps in the reproduction are necessary for 
> the bug to happen:
>  * the join with an empty dataframe
>  * the distinct call on the empty dataframe
>  * the window function
>  * the cache after the order by
> h2. Code
>  
> {code:scala}
>   val players = (10 to 20).map(x => Player(id = x.toString)).toDS
>   val blacklist = sparkSession
>     .emptyDataset[BlacklistEntry]
>     .distinct()
>   val result = players
>     .join(blacklist, Seq("id"), "left_outer")
>     .withColumn("rank", 
> row_number().over(Window.partitionBy("id").orderBy("id")))
>     .orderBy("id")
>     .cache()
>   result.show()
>   result.explain(true)
> {code}
>  
> h2. Output
>  
> {code:java}
> +---++
> | id|rank|
> +---++
> | 15|   1|
> | 11|   1|
> | 16|   1|
> | 18|   1|
> | 17|   1|
> | 19|   1|
> | 20|   1|
> | 10|   1|
> | 12|   1|
> | 13|   1|
> | 14|   1|
> +---++
> == Parsed Logical Plan ==
> 'Sort ['id ASC NULLS FIRST], true
> +- Project [id#1, rank#10]
>+- Project [id#1, rank#10, rank#10]
>   +- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS rank#10], [id#1], [id#1 ASC NULLS FIRST]
>  +- Project [id#1]
> +- Project [id#1]
>+- Join LeftOuter, (id#1 = id#5)
>   :- LocalRelation [id#1]
>   +- Deduplicate [id#5]
>  +- LocalRelation , [id#5]
> == Analyzed Logical Plan ==
> id: string, rank: int
> Sort [id#1 ASC NULLS FIRST], true
> +- Project [id#1, rank#10]
>+- Project [id#1, rank#10, rank#10]
>   +- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS rank#10], [id#1], [id#1 ASC NULLS FIRST]
>  +- Project [id#1]
> +- Project [id#1]
>+- Join LeftOuter, (id#1 = id#5)
>   :- LocalRelation [id#1]
>   +- Deduplicate [id#5]
>  +- LocalRelation , [id#5]
> == Optimized Logical Plan ==
> InMemoryRelation [id#1, rank#10], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>+- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS FIRST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> rank#10], [id#1], [id#1 ASC NULLS FIRST]
>   +- *(1) Sort [id#1 ASC NULLS FIRST, id#1 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(id#1, 200), ENSURE_REQUIREMENTS, [id=#7]
> +- LocalTableScan [id#1]
> == Physical Plan ==
> InMemoryTableScan [id#1, rank#10]
>+- InMemoryRelation [id#1, rank#10], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>  +- Window [row_number() windowspecdefinition(id#1, id#1 ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS rank#10], [id#1], [id#1 ASC NULLS FIRST]
> +- *(1) Sort [id#1 ASC NULLS FIRST, id#1 ASC NULLS FIRST], false, > 0
>+- Exchange hashpartitioning(id#1, 200), ENSURE_REQUIREMENTS, 
> [id=#7]
>   +- LocalTableScan [id#1]{quote}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38028.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35326
[https://github.com/apache/spark/pull/35326]

> Expose Arrow Vector from ArrowColumnVector
> --
>
> Key: SPARK-38028
> URL: https://issues.apache.org/jira/browse/SPARK-38028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.3.0
>
>
> In some cases we need to work with Arrow Vectors behind ColumnVector using 
> Arrow APIs. For example, some Spark extension libraries need to consume Arrow 
> Vectors. For now, it is impossible as the Arrow Vector is private member in 
> ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-01-25 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482188#comment-17482188
 ] 

Wenchen Fan commented on SPARK-37980:
-

I think it's possible for the parquet data sources because Spark uses very 
low-level Parquet APIs and we can do many customizations.

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-01-25 Thread Prakhar Jain (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482142#comment-17482142
 ] 

Prakhar Jain edited comment on SPARK-37980 at 1/26/22, 1:58 AM:


Yeah - this will need implementation for the underlying file format e.g. 
parquet/orc. We can start with parquet first and extend it to other formats 
after that.

[~cloud_fan]  Is it possible to add the support for parquet directly via Spark 
codebase? Will this need changes in parquet-mr?


was (Author: prakharjain09):
Yes - this needs implementation in the underlying datasources such as 
parquet/orc. Also Spark uses the underlying ParquetRecordReader from parquet-mr 
to read a parquet file. All the row group skipping/column index filtering 
happens as part of parquet-mr. So I guess this will need the row index support 
from parquet-mr. The other way is to replicate some of the parquet-mr 
RecordReader code in Spark - which is not ideal.

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38029) Support K8S integration test in SBT

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38029:


Assignee: (was: Apache Spark)

> Support K8S integration test in SBT
> ---
>
> Key: SPARK-38029
> URL: https://issues.apache.org/jira/browse/SPARK-38029
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38029) Support K8S integration test in SBT

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482178#comment-17482178
 ] 

Apache Spark commented on SPARK-38029:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35327

> Support K8S integration test in SBT
> ---
>
> Key: SPARK-38029
> URL: https://issues.apache.org/jira/browse/SPARK-38029
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38029) Support K8S integration test in SBT

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38029:


Assignee: Apache Spark

> Support K8S integration test in SBT
> ---
>
> Key: SPARK-38029
> URL: https://issues.apache.org/jira/browse/SPARK-38029
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38028:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Expose Arrow Vector from ArrowColumnVector
> --
>
> Key: SPARK-38028
> URL: https://issues.apache.org/jira/browse/SPARK-38028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> In some cases we need to work with Arrow Vectors behind ColumnVector using 
> Arrow APIs. For example, some Spark extension libraries need to consume Arrow 
> Vectors. For now, it is impossible as the Arrow Vector is private member in 
> ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38028:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Expose Arrow Vector from ArrowColumnVector
> --
>
> Key: SPARK-38028
> URL: https://issues.apache.org/jira/browse/SPARK-38028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> In some cases we need to work with Arrow Vectors behind ColumnVector using 
> Arrow APIs. For example, some Spark extension libraries need to consume Arrow 
> Vectors. For now, it is impossible as the Arrow Vector is private member in 
> ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482176#comment-17482176
 ] 

Apache Spark commented on SPARK-38028:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/35326

> Expose Arrow Vector from ArrowColumnVector
> --
>
> Key: SPARK-38028
> URL: https://issues.apache.org/jira/browse/SPARK-38028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> In some cases we need to work with Arrow Vectors behind ColumnVector using 
> Arrow APIs. For example, some Spark extension libraries need to consume Arrow 
> Vectors. For now, it is impossible as the Arrow Vector is private member in 
> ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38029) Support K8S integration test in SBT

2022-01-25 Thread William Hyun (Jira)

William Hyun created SPARK-38029:


 Summary: Support K8S integration test in SBT
 Key: SPARK-38029
 URL: https://issues.apache.org/jira/browse/SPARK-38029
 Project: Spark
  Issue Type: Test
  Components: Kubernetes, Tests
Affects Versions: 3.3.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector

2022-01-25 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-38028:
---

Assignee: L. C. Hsieh

> Expose Arrow Vector from ArrowColumnVector
> --
>
> Key: SPARK-38028
> URL: https://issues.apache.org/jira/browse/SPARK-38028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> In some cases we need to work with Arrow Vectors behind ColumnVector using 
> Arrow APIs. For example, some Spark extension libraries need to consume Arrow 
> Vectors. For now, it is impossible as the Arrow Vector is private member in 
> ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38028) Expose Arrow Vector from ArrowColumnVector

2022-01-25 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-38028:
---

 Summary: Expose Arrow Vector from ArrowColumnVector
 Key: SPARK-38028
 URL: https://issues.apache.org/jira/browse/SPARK-38028
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: L. C. Hsieh


In some cases we need to work with Arrow Vectors behind ColumnVector using 
Arrow APIs. For example, some Spark extension libraries need to consume Arrow 
Vectors. For now, it is impossible as the Arrow Vector is private member in 
ArrowColumnVector. We need to expose the Arrow Vector from ArrowColumnVector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174
 ] 

Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:34 AM:
---

[~Saikrishna_Pujari] Thanks for the report the issue!

Actually the ambiguous issue can be handled by setting 
`spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents 
this as a workaround for now.

Are you interested in creating a PR ??


was (Author: itholic):
[~Saikrishna_Pujari] Thanks for the report the issue!

Actually the ambiguous issue can be handled by setting 
`spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents 
this as a workaround for now.

Do you mind to submit a PR ??

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38004:

Affects Version/s: 3.2.0
   (was: 3.1.2)

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174
 ] 

Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:33 AM:
---

[~Saikrishna_Pujari] Thanks for the report the issue!

Actually the ambiguous issue can be handled by setting 
`spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents 
this as a workaround for now.

Do you mind to submit a PR ??


was (Author: itholic):
[~Saikrishna_Pujari] Thanks for the report the issue!

Actually the ambiguous issue can be handled by setting 
`spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents 
this workaround for now.

Do you mind to submit a PR ??

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174
 ] 

Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:33 AM:
---

[~Saikrishna_Pujari] Thanks for the report the issue!

Actually the ambiguous issue can be handled by setting 
`spark.conf.set("spark.sql.caseSensitive","true")`, so I think we can documents 
this workaround for now.

Do you mind to submit a PR ??


was (Author: itholic):
[~Saikrishna_Pujari] Thanks for the report the issue! Setting 
`spark.conf.set("spark.sql.caseSensitive","true")` would make 
`mangle_dupe_cols` work, so I think we can documents this workaround for now. 
Do you want to submit a PR ??

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174
 ] 

Haejoon Lee commented on SPARK-38004:
-

[~Saikrishna_Pujari] Thanks for the report the issue! Setting 
`spark.conf.set("spark.sql.caseSensitive","true")` would make 
`mangle_dupe_cols` work, so I think we can documents this workaround for now.

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482174#comment-17482174
 ] 

Haejoon Lee edited comment on SPARK-38004 at 1/26/22, 1:30 AM:
---

[~Saikrishna_Pujari] Thanks for the report the issue! Setting 
`spark.conf.set("spark.sql.caseSensitive","true")` would make 
`mangle_dupe_cols` work, so I think we can documents this workaround for now. 
Do you want to submit a PR ??


was (Author: itholic):
[~Saikrishna_Pujari] Thanks for the report the issue! Setting 
`spark.conf.set("spark.sql.caseSensitive","true")` would make 
`mangle_dupe_cols` work, so I think we can documents this workaround for now.

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38015) Mark legacy file naming functions as deprecated in FileCommitProtocol

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38015:


Assignee: Cheng Su

> Mark legacy file naming functions as deprecated in FileCommitProtocol
> -
>
> Key: SPARK-38015
> URL: https://issues.apache.org/jira/browse/SPARK-38015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> [FileCommitProtocol|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala]
>  is the class to commit Spark job output (staging file & directory renaming, 
> etc). During Spark 3.2 development, we added new functions into this class to 
> allow more flexible output file naming (the PR detail is 
> [here|https://github.com/apache/spark/pull/33012]). We didn’t delete the 
> existing file naming functions (newTaskTempFile(ext) & 
> newTaskTempFileAbsPath(ext)), because we were aware of many other downstream 
> projects or codebases already implemented their own custom implementation for 
> FileCommitProtocol. Delete the existing functions would be a breaking change 
> for them when upgrading Spark version, and we would like to avoid this 
> unpleasant surprise for anyone if possible. But we also need to clean up 
> legacy as we evolve our codebase.
> So for next step, I would like to propose:
>  * Spark 3.3 (now): Add @deprecate annotation to legacy functions in 
> FileCommitProtocol - 
> [newTaskTempFile(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L98]
>  & 
> [newTaskTempFileAbsPath(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L135].
>  * Next Spark major release (or whenever people feel comfortable): delete the 
> legacy functions mentioned above from our codebase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38015) Mark legacy file naming functions as deprecated in FileCommitProtocol

2022-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38015.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35311
[https://github.com/apache/spark/pull/35311]

> Mark legacy file naming functions as deprecated in FileCommitProtocol
> -
>
> Key: SPARK-38015
> URL: https://issues.apache.org/jira/browse/SPARK-38015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> [FileCommitProtocol|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala]
>  is the class to commit Spark job output (staging file & directory renaming, 
> etc). During Spark 3.2 development, we added new functions into this class to 
> allow more flexible output file naming (the PR detail is 
> [here|https://github.com/apache/spark/pull/33012]). We didn’t delete the 
> existing file naming functions (newTaskTempFile(ext) & 
> newTaskTempFileAbsPath(ext)), because we were aware of many other downstream 
> projects or codebases already implemented their own custom implementation for 
> FileCommitProtocol. Delete the existing functions would be a breaking change 
> for them when upgrading Spark version, and we would like to avoid this 
> unpleasant surprise for anyone if possible. But we also need to clean up 
> legacy as we evolve our codebase.
> So for next step, I would like to propose:
>  * Spark 3.3 (now): Add @deprecate annotation to legacy functions in 
> FileCommitProtocol - 
> [newTaskTempFile(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L98]
>  & 
> [newTaskTempFileAbsPath(ext)|https://github.com/apache/spark/blob/6bbfb45ffe75aa6c27a7bf3c3385a596637d1822/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L135].
>  * Next Spark major release (or whenever people feel comfortable): delete the 
> legacy functions mentioned above from our codebase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37793) Invalid LocalMergedBlockData cause task hang

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482148#comment-17482148
 ] 

Apache Spark commented on SPARK-37793:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/35325

> Invalid LocalMergedBlockData cause task hang
> 
>
> Key: SPARK-37793
> URL: https://issues.apache.org/jira/browse/SPARK-37793
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Priority: Critical
>
> When enable push-based shuffle, there is a chance that task hang
>  
> {code:java}
> 59Executor task launch worker for task 424.0 in stage 753.0 (TID 106778)  
> WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1660371198})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:753)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
> scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.sort_addToSorter_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.smj_findNextJoinRows_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_1$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> org.apache.spark.scheduler.Task.run(Task.scala:136)
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$518/852390142.apply(Unknown
>  Source)
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> ShuffleBlockFetcherIterator.scala:753
> {code:java}
> while (result == null) {
>   val startFetchWait = System.nanoTime()
> 753>  result = results.take()
>   val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - 
> startFetchWait)
>   shuffleMetrics.incFetchWaitTime(fetchWaitTime)
>   ..
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37793) Invalid LocalMergedBlockData cause task hang

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482149#comment-17482149
 ] 

Apache Spark commented on SPARK-37793:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/35325

> Invalid LocalMergedBlockData cause task hang
> 
>
> Key: SPARK-37793
> URL: https://issues.apache.org/jira/browse/SPARK-37793
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Priority: Critical
>
> When enable push-based shuffle, there is a chance that task hang
>  
> {code:java}
> 59Executor task launch worker for task 424.0 in stage 753.0 (TID 106778)  
> WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1660371198})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:753)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
> scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.sort_addToSorter_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.smj_findNextJoinRows_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_1$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> org.apache.spark.scheduler.Task.run(Task.scala:136)
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$518/852390142.apply(Unknown
>  Source)
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> ShuffleBlockFetcherIterator.scala:753
> {code:java}
> while (result == null) {
>   val startFetchWait = System.nanoTime()
> 753>  result = results.take()
>   val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - 
> startFetchWait)
>   shuffleMetrics.incFetchWaitTime(fetchWaitTime)
>   ..
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37793) Invalid LocalMergedBlockData cause task hang

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482147#comment-17482147
 ] 

Apache Spark commented on SPARK-37793:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/35325

> Invalid LocalMergedBlockData cause task hang
> 
>
> Key: SPARK-37793
> URL: https://issues.apache.org/jira/browse/SPARK-37793
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Priority: Critical
>
> When enable push-based shuffle, there is a chance that task hang
>  
> {code:java}
> 59Executor task launch worker for task 424.0 in stage 753.0 (TID 106778)  
> WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1660371198})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:753)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
> scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.sort_addToSorter_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.smj_findNextJoinRows_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_1$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithKeys_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> org.apache.spark.scheduler.Task.run(Task.scala:136)
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$518/852390142.apply(Unknown
>  Source)
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1470)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> ShuffleBlockFetcherIterator.scala:753
> {code:java}
> while (result == null) {
>   val startFetchWait = System.nanoTime()
> 753>  result = results.take()
>   val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - 
> startFetchWait)
>   shuffleMetrics.incFetchWaitTime(fetchWaitTime)
>   ..
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37675) Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482146#comment-17482146
 ] 

Apache Spark commented on SPARK-37675:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/35325

> Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block
> --
>
> Key: SPARK-37675
> URL: https://issues.apache.org/jira/browse/SPARK-37675
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37675) Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482145#comment-17482145
 ] 

Apache Spark commented on SPARK-37675:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/35325

> Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block
> --
>
> Key: SPARK-37675
> URL: https://issues.apache.org/jira/browse/SPARK-37675
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-01-25 Thread Prakhar Jain (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482142#comment-17482142
 ] 

Prakhar Jain commented on SPARK-37980:
--

Yes - this needs implementation in the underlying datasources such as 
parquet/orc. Also Spark uses the underlying ParquetRecordReader from parquet-mr 
to read a parquet file. All the row group skipping/column index filtering 
happens as part of parquet-mr. So I guess this will need the row index support 
from parquet-mr. The other way is to replicate some of the parquet-mr 
RecordReader code in Spark - which is not ideal.

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38022:
--
Fix Version/s: (was: 3.2.2)

> Use relativePath for K8s remote file test in BasicTestsSuite
> 
>
> Key: SPARK-38022
> URL: https://issues.apache.org/jira/browse/SPARK-38022
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> *BEFORE*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 
> minutes, 3 seconds)
> [info]   The code passed to eventually never returned normally. Attempted 190 
> times over 3.01226506667 minutes. Last failure message: false was not 
> true. (KubernetesSuite.scala:452)
> ... {code}
> *AFTER*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 
> milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38023:
--
Fix Version/s: 3.2.2
   (was: 3.2.1)

> ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as 
> finished
> 
>
> Key: SPARK-38023
> URL: https://issues.apache.org/jira/browse/SPARK-38023
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38023:
-

Assignee: Dongjoon Hyun

> ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as 
> finished
> 
>
> Key: SPARK-38023
> URL: https://issues.apache.org/jira/browse/SPARK-38023
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38023.
---
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 35321
[https://github.com/apache/spark/pull/35321]

> ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as 
> finished
> 
>
> Key: SPARK-38023
> URL: https://issues.apache.org/jira/browse/SPARK-38023
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-01-25 Thread Evan Zamir (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482128#comment-17482128
 ] 

Evan Zamir commented on SPARK-38027:


Looking into this further I think the issue is arising upon serializing the 
model either logging it or persisting it to disk. From my logs:

2022-01-25 14:21:33,664 root ERROR An error occurred while calling 
o1538.toString.
: java.util.NoSuchElementException: Failed to find a default value for link
at 
org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
at org.apache.spark.ml.param.Params.$(params.scala:762)
at org.apache.spark.ml.param.Params.$$(params.scala:762)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
at 
org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


> Undefined link function causing error in GLM that uses Tweedie family
> -
>
> Key: SPARK-38027
> URL: https://issues.apache.org/jira/browse/SPARK-38027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.1.2
> Environment: Running on Mac OS X Monterey
>Reporter: Evan Zamir
>Priority: Major
>  Labels: GLM, pyspark
>
> I am trying to use the GLM regression with a Tweedie distribution so I can 
> model insurance use cases. I have set up a very simple example adapted from 
> the docs:
> {code:python}
> def create_fake_losses_data(self):
> df = self._spark.createDataFrame([
> ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
> ("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
> ("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", 
> "label", "offset", "weight", "features"])
> logging.info(df.collect())
> setattr(self, 'fake_data', df)
> try:
> glr = GeneralizedLinearRegression(
> family="tweedie", variancePower=1.5, linkPower=-1, 
> offsetCol='offset')
> glr.setRegParam(0.3)
> model = glr.fit(df)
> logging.info(model)
> except Py4JJavaError as e:
> print(e)
> return self
> {code}
> This causes the following error:
> *py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
> : java.util.NoSuchElementException: Failed to find a default value for link*
> at 
> org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
> at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
> at org.apache.spark.ml.param.Params.$(params.scala:762)
> at org.apache.spark.ml.param.Params.$$(params.scala:762)
> at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at 
>

[jira] [Commented] (SPARK-37896) ConstantColumnVector: a column vector with same values

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482117#comment-17482117
 ] 

Apache Spark commented on SPARK-37896:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/35324

> ConstantColumnVector: a column vector with same values
> --
>
> Key: SPARK-37896
> URL: https://issues.apache.org/jira/browse/SPARK-37896
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.3.0
>
>
> Introduce a new column vector named `ConstantColumnVector`, it represents a 
> column vector where every row has the same constant value.
> It could help improve performance on hidden file metadata columnar file 
> format, since metadata fields for every row in each file are exactly the 
> same, we don't need to copy and keep multiple copies of data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38027) Undefined link function causing error in GLM that uses Tweedie family

2022-01-25 Thread Evan Zamir (Jira)

Evan Zamir created SPARK-38027:
--

 Summary: Undefined link function causing error in GLM that uses 
Tweedie family
 Key: SPARK-38027
 URL: https://issues.apache.org/jira/browse/SPARK-38027
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.1.2
 Environment: Running on Mac OS X Monterey
Reporter: Evan Zamir


I am trying to use the GLM regression with a Tweedie distribution so I can 
model insurance use cases. I have set up a very simple example adapted from the 
docs:


{code:python}
def create_fake_losses_data(self):
df = self._spark.createDataFrame([
("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
("b", 0.0, 12, 1, Vectors.dense(1.0, 2.0)),
("c", 0.0, 12, 1, Vectors.dense(0.0, 0.0)),
("d", 2000.0, 12, 1, Vectors.dense(1.0, 1.0)), ], ["user", "label", 
"offset", "weight", "features"])
logging.info(df.collect())
setattr(self, 'fake_data', df)
try:
glr = GeneralizedLinearRegression(
family="tweedie", variancePower=1.5, linkPower=-1, 
offsetCol='offset')
glr.setRegParam(0.3)
model = glr.fit(df)
logging.info(model)
except Py4JJavaError as e:
print(e)
return self
{code}

This causes the following error:

*py4j.protocol.Py4JJavaError: An error occurred while calling o99.toString.
: java.util.NoSuchElementException: Failed to find a default value for link*
at 
org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
at org.apache.spark.ml.param.Params.$(params.scala:762)
at org.apache.spark.ml.param.Params.$$(params.scala:762)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
at 
org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


I was under the assumption that the default value for link is None, if not 
defined otherwise.
 
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken

2022-01-25 Thread Thejdeep Gudivada (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482080#comment-17482080
 ] 

Thejdeep Gudivada commented on SPARK-38026:
---

Duplicate of https://issues.apache.org/jira/browse/SPARK-35087

 

> Sorting in Executors summary table in Stages Page is broken
> ---
>
> Key: SPARK-38026
> URL: https://issues.apache.org/jira/browse/SPARK-38026
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2
>Reporter: Thejdeep Gudivada
>Priority: Major
> Attachments: image (5).png
>
>
> Sorting of certain columns in the Executors Summary table in the Stages Page 
> is broken as it ignores the size units in the field value.
> For example, shown in the attachment, sorting the Input Size / Records column 
> in a decreasing order shows the error.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken

2022-01-25 Thread Thejdeep Gudivada (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejdeep Gudivada resolved SPARK-38026.
---
Resolution: Duplicate

> Sorting in Executors summary table in Stages Page is broken
> ---
>
> Key: SPARK-38026
> URL: https://issues.apache.org/jira/browse/SPARK-38026
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2
>Reporter: Thejdeep Gudivada
>Priority: Major
> Attachments: image (5).png
>
>
> Sorting of certain columns in the Executors Summary table in the Stages Page 
> is broken as it ignores the size units in the field value.
> For example, shown in the attachment, sorting the Input Size / Records column 
> in a decreasing order shows the error.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken

2022-01-25 Thread Thejdeep Gudivada (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejdeep Gudivada updated SPARK-38026:
--
Description: 
Sorting of certain columns in the Executors Summary table in the Stages Page is 
broken as it ignores the size units in the field value.

For example, shown in the attachment, sorting the Input Size / Records column 
in a decreasing order shows the error.

  was:
Sorting of certain columns in the Executors Summary table in the Stages Page is 
broken as it ignores the size units in the field value.

For example, sorting the Input Size / Records column in a decreasing order 
shows the error.

!image-2022-01-25-11-47-46-201.png!


> Sorting in Executors summary table in Stages Page is broken
> ---
>
> Key: SPARK-38026
> URL: https://issues.apache.org/jira/browse/SPARK-38026
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2
>Reporter: Thejdeep Gudivada
>Priority: Major
> Attachments: image (5).png
>
>
> Sorting of certain columns in the Executors Summary table in the Stages Page 
> is broken as it ignores the size units in the field value.
> For example, shown in the attachment, sorting the Input Size / Records column 
> in a decreasing order shows the error.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken

2022-01-25 Thread Thejdeep Gudivada (Jira)

Thejdeep Gudivada created SPARK-38026:
-

 Summary: Sorting in Executors summary table in Stages Page is 
broken
 Key: SPARK-38026
 URL: https://issues.apache.org/jira/browse/SPARK-38026
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.2
Reporter: Thejdeep Gudivada
 Attachments: image (5).png

Sorting of certain columns in the Executors Summary table in the Stages Page is 
broken as it ignores the size units in the field value.

For example, sorting the Input Size / Records column in a decreasing order 
shows the error.

!image-2022-01-25-11-47-46-201.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38026) Sorting in Executors summary table in Stages Page is broken

2022-01-25 Thread Thejdeep Gudivada (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejdeep Gudivada updated SPARK-38026:
--
Attachment: image (5).png

> Sorting in Executors summary table in Stages Page is broken
> ---
>
> Key: SPARK-38026
> URL: https://issues.apache.org/jira/browse/SPARK-38026
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2
>Reporter: Thejdeep Gudivada
>Priority: Major
> Attachments: image (5).png
>
>
> Sorting of certain columns in the Executors Summary table in the Stages Page 
> is broken as it ignores the size units in the field value.
> For example, sorting the Input Size / Records column in a decreasing order 
> shows the error.
> !image-2022-01-25-11-47-46-201.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2022-01-25 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482041#comment-17482041
 ] 

Attila Zsolt Piros commented on SPARK-34372:


hi [~daeheh]! Please look around here: 
https://spark.apache.org/docs/3.2.0/cloud-integration.html and switch to s3a.



> Speculation results in broken CSV files in Amazon S3
> 
>
> Key: SPARK-34372
> URL: https://issues.apache.org/jira/browse/SPARK-34372
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.7
> Environment: Amazon EMR with AMI version 5.32.0
>Reporter: Daehee Han
>Priority: Minor
>  Labels: csv, s3, spark, speculation
>
> Hi, we've been experiencing some rows get corrupted while partitioned CSV 
> files were written to Amazon S3. Some records were found broken without any 
> error on Spark. Digging into the root cause, we found out Spark speculation 
> tried to upload a partition being uploaded slowly and ended up uploading only 
> a part of the partition, letting broken data uploaded to S3.
> Here're stacktraces we've found. There are two executor involved - A: the 
> first executor which tried to upload the file, but it took much longer than 
> other executor (but still succeeded), which made spark speculation cut in and 
> kick off another executor B. Executor B started to upload the file too, but 
> was interrupted during uploading (killed: another attempt succeeded), and 
> ended up uploading only a part of the whole file. You can see in the log, the 
> file executor A uploaded (8461990 bytes originally) was overwritten by 
> executor B (uploaded only 3145728 bytes).
>  
> Executor A:
> {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
> 13201) 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 10 local blocks and 460 remote blocks 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 18 ms 
>  21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
>  21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
> closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
>  21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 8461990 bytes 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 10. 
>  21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of 
> task because needsTaskCommit=false: 
> attempt_20210128172219_0045_m_000426_13201 
>  21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
> 13201). 8782 bytes result sent to driver
> {quote}
> Executor B:
> {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
> 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
> 13245) 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 11 local blocks and 459 remote blocks 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 2 ms 
>  21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer 
> class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
>  21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
> stage 45.0 (TID 13245), reason: another attempt succeeded 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
>  21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 3145728 bytes 
>  21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 0. 
>  21/01/28 17:22:32 ERROR Utils: Aborting task 
> com.univocity.parsers.common.TextWritingException: Error writing row. 
> Internal state

[jira] [Commented] (SPARK-38025) Improve test suite ExternalCatalogSuite

2022-01-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481992#comment-17481992
 ] 

Apache Spark commented on SPARK-38025:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/35323

> Improve test suite ExternalCatalogSuite
> ---
>
> Key: SPARK-38025
> URL: https://issues.apache.org/jira/browse/SPARK-38025
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Test suite *ExternalCatalogSuite.scala* can be optimized by removing 
> repetitive code by replacing them with already available utility function 
> with some minor changes. This will reduce redundant code, simplify the suite 
> and improve readability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38025) Improve test suite ExternalCatalogSuite

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38025:


Assignee: Apache Spark

> Improve test suite ExternalCatalogSuite
> ---
>
> Key: SPARK-38025
> URL: https://issues.apache.org/jira/browse/SPARK-38025
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3
>Reporter: Khalid Mammadov
>Assignee: Apache Spark
>Priority: Minor
>
> Test suite *ExternalCatalogSuite.scala* can be optimized by removing 
> repetitive code by replacing them with already available utility function 
> with some minor changes. This will reduce redundant code, simplify the suite 
> and improve readability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38025) Improve test suite ExternalCatalogSuite

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38025:


Assignee: (was: Apache Spark)

> Improve test suite ExternalCatalogSuite
> ---
>
> Key: SPARK-38025
> URL: https://issues.apache.org/jira/browse/SPARK-38025
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Test suite *ExternalCatalogSuite.scala* can be optimized by removing 
> repetitive code by replacing them with already available utility function 
> with some minor changes. This will reduce redundant code, simplify the suite 
> and improve readability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38022:
--
Fix Version/s: 3.2.2
   (was: 3.2.1)

> Use relativePath for K8s remote file test in BasicTestsSuite
> 
>
> Key: SPARK-38022
> URL: https://issues.apache.org/jira/browse/SPARK-38022
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> *BEFORE*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 
> minutes, 3 seconds)
> [info]   The code passed to eventually never returned normally. Attempted 190 
> times over 3.01226506667 minutes. Last failure message: false was not 
> true. (KubernetesSuite.scala:452)
> ... {code}
> *AFTER*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 
> milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite

2022-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38022.
---
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 35318
[https://github.com/apache/spark/pull/35318]

> Use relativePath for K8s remote file test in BasicTestsSuite
> 
>
> Key: SPARK-38022
> URL: https://issues.apache.org/jira/browse/SPARK-38022
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>
> *BEFORE*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 
> minutes, 3 seconds)
> [info]   The code passed to eventually never returned normally. Attempted 190 
> times over 3.01226506667 minutes. Last failure message: false was not 
> true. (KubernetesSuite.scala:452)
> ... {code}
> *AFTER*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 
> milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38025) Improve test suite ExternalCatalogSuite

2022-01-25 Thread Khalid Mammadov (Jira)

Khalid Mammadov created SPARK-38025:
---

 Summary: Improve test suite ExternalCatalogSuite
 Key: SPARK-38025
 URL: https://issues.apache.org/jira/browse/SPARK-38025
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.3
Reporter: Khalid Mammadov


Test suite *ExternalCatalogSuite.scala* can be optimized by removing repetitive 
code by replacing them with already available utility function with some minor 
changes. This will reduce redundant code, simplify the suite and improve 
readability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38024) add support for INFORMATION_SCHEMA or other catalog variant

2022-01-25 Thread Stephen Wilcoxon (Jira)

Stephen Wilcoxon created SPARK-38024:


 Summary: add support for INFORMATION_SCHEMA or other catalog 
variant
 Key: SPARK-38024
 URL: https://issues.apache.org/jira/browse/SPARK-38024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Stephen Wilcoxon


The ability to query the metadata (from SQL) can be extremely useful.  There 
are ways to get at the metadata via python/scala/whatever but not from within 
SQL.

Given that this is a widely adapted part of SQL92, it seems like a major 
omission not to be supported in Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2022-01-25 Thread Stephen Wilcoxon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481935#comment-17481935
 ] 

Stephen Wilcoxon commented on SPARK-16452:
--

When will this be reexamined?  The ability to query the metadata (from SQL) can 
be extremely useful.  There are ways to get at the metadata via 
python/scala/whatever but not from within Spark SQL.

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Saikrishna Pujari (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481887#comment-17481887
 ] 

Saikrishna Pujari commented on SPARK-38004:
---

[~itholic] I suppose we are going to address this as a documented improvement 
to add a note that the case sensitive columns are considered as different 
columns and we get ambiguity issues. same case columns will be handled part of 
mangle_dupe_cols()

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Saikrishna Pujari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikrishna Pujari updated SPARK-38004:
--
Description: 
mangle_dupe_cols - default is True
So ideally it should have handled duplicate columns, but in case the columns 
are case sensitive it fails as below.

AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
Sheet.col, Sheet.col.

Where two columns are Col and cOL

In the best practices, there is a mention of not to use case sensitive columns 
- 
[https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]

Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
or it has to be handled.

  was:
mangle_dupe_cols - default is True
So ideally it should have handled duplicate columns, but in case the columns 
are case sensitive it fails as below.

AnalysisException: Reference '{{{}Sheet.col1{}}}' is ambiguous, could be 
Sheet.col1, Sheet.col1.

Where two columns are Col and cOL

In the best practices, there is a mention of not to use case sensitive columns 
- 
[https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]

Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
or it has to be handled.


> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Saikrishna Pujari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikrishna Pujari updated SPARK-38004:
--
Issue Type: Documentation  (was: Bug)

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col1{}}}' is ambiguous, could be 
> Sheet.col1, Sheet.col1.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-01-25 Thread Saikrishna Pujari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikrishna Pujari updated SPARK-38004:
--
Priority: Minor  (was: Major)

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col1{}}}' is ambiguous, could be 
> Sheet.col1, Sheet.col1.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default

2022-01-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37479.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35202
[https://github.com/apache/spark/pull/35202]

> Migrate DROP NAMESPACE to use V2 command by default
> ---
>
> Key: SPARK-37479
> URL: https://issues.apache.org/jira/browse/SPARK-37479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default

2022-01-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37479:
---

Assignee: dch nguyen

> Migrate DROP NAMESPACE to use V2 command by default
> ---
>
> Key: SPARK-37479
> URL: https://issues.apache.org/jira/browse/SPARK-37479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37999) Spark executor self-exiting due to driver disassociated in Kubernetes

2022-01-25 Thread Petri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petri updated SPARK-37999:
--
Description: 
I have Spark driver running in a Kubernetes pod with client deploy-mode.I have 
created a headless K8S service with name 'lola' at port 7077 which targets the 
driver pod.
Driver pod will launch successfully and tries to start an executor, but 
eventually the executor will fail with error:
{code:java}
Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! 
Shutting down.{code}
Then driver stays up and running and will attempt to start another executor 
which fails with same error and this goes on and on, driver spawning new 
failing executors.

In the driver pod, I see only following errors (when using 'grep ERROR'):
{code:java}
22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105:
22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106:
22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: 
The executor with ID 7 (registered at 1643031697505 ms) was not found in the 
cluster at the polling time (1643031731509 ms) which is after the accepted 
detect delta time (3 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103:
22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 
192.168.50.220:{code}
 

Full log from the executor:
{code:java}
+ source /opt/spark/bin/common.sh
+ cp /etc/group /tmp/group
+ cp /etc/passwd /tmp/passwd
++ id -u
+ myuid=1501
++ id -g
+ mygid=0
+ myuname=cspk
+ fsgid=
+ fsgrpname=cspk
+ set +e
++ getent passwd 1501
+ uidentry=
++ cat /etc/machine-id
cat: /etc/machine-id: No such file or directory
+ export SYSTEMID=
+ SYSTEMID=
+ set -e
+ '[' -z '' ']'
+ '[' -w /tmp/group ']'
+ echo cspk:x::
+ cp /etc/passwd /tmp/passwd.template
+ '[' -z '' ']'
+ '[' -w /tmp/passwd.template ']'
+ echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false'
+ envsubst
+ export LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ export NSS_WRAPPER_PASSWD=/tmp/passwd
+ NSS_WRAPPER_PASSWD=/tmp/passwd
+ export NSS_WRAPPER_GROUP=/tmp/group
+ NSS_WRAPPER_GROUP=/tmp/group
+ SPARK_K8S_CMD=executor
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ env
+ sort -t_ -k4 -n
+ grep SPARK_AUTH_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_AUTH_OPTS
+ env
+ grep SPARK_NET_CRYPTO_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_NET_CRYPTO_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ set +x
TLS Not enabled for WebServer
+ CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" 
"${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" 
-Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
$SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores 
$SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname 
$SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java 
-Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 
-Dlog4j.configurationFile=http://192.168.80.89:/log4j2.xml 
-Dlog4j.configuration=http://192.168.80.89:/log4j2.xml 
-Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m 
-Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@lola.mni-system:7077 --executor-id 10 --cores 3 
--app-id spark-application-1643031611044 --hostname 192.168.82.121
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
{"type":"log",

[jira] [Updated] (SPARK-37999) Spark executor self-exiting due to driver disassociated in Kubernetes

2022-01-25 Thread Petri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petri updated SPARK-37999:
--
Description: 
I have Spark driver running in a Kubernetes pod with client deploy-mode.I have 
created a headless K8S service with name 'lola' at port 7077 which targets the 
driver pod.
Driver pod will launch successfully and tries to start an executor, but 
eventually the executor will fail with error:
{code:java}
Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! 
Shutting down.{code}
Then driver stays up and running and will attempt to start another executor 
which fails with same error and this goes on and on, driver spawning new 
failing executors.

In the driver pod, I see only following errors (when using 'grep ERROR'):
{code:java}
22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105:
22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106:
22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: 
The executor with ID 7 (registered at 1643031697505 ms) was not found in the 
cluster at the polling time (1643031731509 ms) which is after the accepted 
detect delta time (3 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103:
22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 
192.168.50.220:{code}
 

Full log from the executor:
{code:java}
+ source /opt/spark/bin/common.sh
+ cp /etc/group /tmp/group
+ cp /etc/passwd /tmp/passwd
++ id -u
+ myuid=1501
++ id -g
+ mygid=0
+ myuname=cspk
+ fsgid=
+ fsgrpname=cspk
+ set +e
++ getent passwd 1501
+ uidentry=
++ cat /etc/machine-id
cat: /etc/machine-id: No such file or directory
+ export SYSTEMID=
+ SYSTEMID=
+ set -e
+ '[' -z '' ']'
+ '[' -w /tmp/group ']'
+ echo cspk:x::
+ cp /etc/passwd /tmp/passwd.template
+ '[' -z '' ']'
+ '[' -w /tmp/passwd.template ']'
+ echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false'
+ envsubst
+ export LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ export NSS_WRAPPER_PASSWD=/tmp/passwd
+ NSS_WRAPPER_PASSWD=/tmp/passwd
+ export NSS_WRAPPER_GROUP=/tmp/group
+ NSS_WRAPPER_GROUP=/tmp/group
+ SPARK_K8S_CMD=executor
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ env
+ sort -t_ -k4 -n
+ grep SPARK_AUTH_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_AUTH_OPTS
+ env
+ grep SPARK_NET_CRYPTO_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_NET_CRYPTO_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ set +x
TLS Not enabled for WebServer
+ CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" 
"${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" 
-Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
$SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores 
$SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname 
$SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java 
-Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 
-Dlog4j.configurationFile=http://192.168.80.89:/log4j2.xml 
-Dlog4j.configuration=http://192.168.80.89:/log4j2.xml 
-Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m 
-Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@lola.mni-system:7077 --executor-id 10 --cores 3 
--app-id spark-application-1643031611044 --hostname 192.168.82.121
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
{"type":"log",

[jira] [Assigned] (SPARK-38023) ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished

2022-01-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38023:


Assignee: (was: Apache Spark)

> ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as 
> finished
> 
>
> Key: SPARK-38023
> URL: https://issues.apache.org/jira/browse/SPARK-38023
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.3, 3.2.1, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 153 matches

Mail list logo