date:20240227

[jira] [Updated] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42929:
---
Labels: pull-request-available  (was: )

> make mapInPandas / mapInArrow support "is_barrier"
> --
>
> Key: SPARK-42929
> URL: https://issues.apache.org/jira/browse/SPARK-42929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> make mapInPandas / mapInArrow support "is_barrier"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47194) Upgrade log4j2 to 2.23.0

2024-02-27 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821543#comment-17821543
 ] 

Yang Jie edited comment on SPARK-47194 at 2/28/24 7:19 AM:
---

It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps 
we should skip this upgrade. I have tested the following scenarios:

1. run `dev/make-distribution.sh --tgz` to build a Spark Client
2. add `log4j2.properties` and `spark-defaults.conf` with the same content as  
test case `Verify logging configuration is picked from the provided 
SPARK_CONF_DIR/log4j2.properties`

```
log4j2.properties 
 # This log4j config file is for integration test SparkConfPropagateSuite.
rootLogger.level = debug
rootLogger.appenderRef.stdout.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d\{HH:mm:ss.SSS} %p %c: %maxLen\{%m}

{512}

%n%ex\{8}%n
```

```
spark-defaults.conf

spark.driver.extraJavaOptions -Dlog4j2.debug
spark.executor.extraJavaOptions -Dlog4j2.debug
spark.kubernetes.executor.deleteOnTermination false
```

3. run `bin/run-example SparkPi`

When using log4j 2.22.1, we can have the following log:

```
...

TRACE StatusLogger DefaultConfiguration cleaning Appenders from 1 LoggerConfigs.
DEBUG StatusLogger Stopped 
org.apache.logging.log4j.core.config.DefaultConfiguration@384ad17b OK
TRACE StatusLogger Reregistering MBeans after reconfigure. 
Selector=org.apache.logging.log4j.core.selector.ClassLoaderContextSelector@5852c06f
TRACE StatusLogger Reregistering context (1/1): '5ffd2b27' 
org.apache.logging.log4j.core.LoggerContext@31190526
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=*'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncAppenders,name=*'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncLoggerRingBuffer'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*,subtype=RingBuffer'
DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=console
TRACE StatusLogger Using default SystemClock for timestamps.
DEBUG StatusLogger org.apache.logging.log4j.core.util.SystemClock supports 
precise timestamps.
TRACE StatusLogger Using DummyNanoClock for nanosecond timestamps.
DEBUG StatusLogger Reconfiguration complete for context[name=5ffd2b27] at URI 
/Users/yangjie01/Tools/4.0/spark-4.0.0-SNAPSHOT-bin-3.3.6/conf/log4j2.properties
 (org.apache.logging.log4j.core.LoggerContext@31190526) with optional 
ClassLoader: null
DEBUG StatusLogger Shutdown hook enabled. Registering a new one.
...
```

But when using log4j 2.23.0, no logs related to `StatusLogger` are printed. 

So let's skip this upgrade


was (Author: luciferyang):
It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps 
we should skip this upgrade. I have tested the following scenarios:

1. run `dev/make-distribution.sh --tgz` to build a Spark Client
2. add `log4j2.properties` and `spark-defaults.conf` with the same content as  
test case `Verify logging configuration is picked from the provided 
SPARK_CONF_DIR/log4j2.properties`

```
log4j2.properties 

# This log4j config file is for integration test SparkConfPropagateSuite.
rootLogger.level = debug
rootLogger.appenderRef.stdout.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d\{HH:mm:ss.SSS} %p %c: 
%maxLen\{%m}{512}%n%ex\{8}%n
```

```
spark-defaults.conf

spark.driver.extraJavaOptions -Dlog4j2.debug
spark.executor.extraJavaOptions -Dlog4j2.debug
spark.kubernetes.executor.deleteOnTermination false

[jira] [Resolved] (SPARK-47194) Upgrade log4j2 to 2.23.0

2024-02-27 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-47194.
--
Resolution: Won't Fix

It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps 
we should skip this upgrade. I have tested the following scenarios:

1. run `dev/make-distribution.sh --tgz` to build a Spark Client
2. add `log4j2.properties` and `spark-defaults.conf` with the same content as  
test case `Verify logging configuration is picked from the provided 
SPARK_CONF_DIR/log4j2.properties`

```
log4j2.properties 

# This log4j config file is for integration test SparkConfPropagateSuite.
rootLogger.level = debug
rootLogger.appenderRef.stdout.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d\{HH:mm:ss.SSS} %p %c: 
%maxLen\{%m}{512}%n%ex\{8}%n
```

```
spark-defaults.conf

spark.driver.extraJavaOptions -Dlog4j2.debug
spark.executor.extraJavaOptions -Dlog4j2.debug
spark.kubernetes.executor.deleteOnTermination false
```

3. run `bin/run-example SparkPi`

When using log4j 2.22.1, we can have the following log:

```
...

TRACE StatusLogger DefaultConfiguration cleaning Appenders from 1 LoggerConfigs.
DEBUG StatusLogger Stopped 
org.apache.logging.log4j.core.config.DefaultConfiguration@384ad17b OK
TRACE StatusLogger Reregistering MBeans after reconfigure. 
Selector=org.apache.logging.log4j.core.selector.ClassLoaderContextSelector@5852c06f
TRACE StatusLogger Reregistering context (1/1): '5ffd2b27' 
org.apache.logging.log4j.core.LoggerContext@31190526
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=*'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncAppenders,name=*'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncLoggerRingBuffer'
TRACE StatusLogger Unregistering but no MBeans found matching 
'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*,subtype=RingBuffer'
DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=
DEBUG StatusLogger Registering MBean 
org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=console
TRACE StatusLogger Using default SystemClock for timestamps.
DEBUG StatusLogger org.apache.logging.log4j.core.util.SystemClock supports 
precise timestamps.
TRACE StatusLogger Using DummyNanoClock for nanosecond timestamps.
DEBUG StatusLogger Reconfiguration complete for context[name=5ffd2b27] at URI 
/Users/yangjie01/Tools/4.0/spark-4.0.0-SNAPSHOT-bin-3.3.6/conf/log4j2.properties
 (org.apache.logging.log4j.core.LoggerContext@31190526) with optional 
ClassLoader: null
DEBUG StatusLogger Shutdown hook enabled. Registering a new one.
...
```

But when using log4j 2.23.0, no logs related to `StatusLogger` are printed. 

cc @dongjoon-hyun 

> Upgrade log4j2 to 2.23.0
> 
>
> Key: SPARK-47194
> URL: https://issues.apache.org/jira/browse/SPARK-47194
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47199:
-

Assignee: Hyukjin Kwon

> Add prefix into TemporaryDirectory to avoid flakiness
> -
>
> Key: SPARK-47199
> URL: https://issues.apache.org/jira/browse/SPARK-47199
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> Sometimes the test fail because the temporary directory names are same 
> (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390).
> {code}
> File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in 
> pyspark.sql.dataframe.DataFrame.writeStream
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Create a table with Rate source.
> df.writeStream.toTable(
> "my_table", checkpointLocation=d)
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.11/doctest.py", line 1353, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 1, in 
> with tempfile.TemporaryDirectory() as d:
>   File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__
> self.cleanup()
>   File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup
> self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
>   File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree
> _rmtree(name, onerror=onerror)
>   File "/usr/lib/python3.11/shutil.py", line 738, in rmtree
> onerror(os.rmdir, path, sys.exc_info())
>   File "/usr/lib/python3.11/shutil.py", line 736, in rmtree
> os.rmdir(path, dir_fd=dir_fd)
> OSError: [Errno 39] Directory not empty: 
> '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47199.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45298
[https://github.com/apache/spark/pull/45298]

> Add prefix into TemporaryDirectory to avoid flakiness
> -
>
> Key: SPARK-47199
> URL: https://issues.apache.org/jira/browse/SPARK-47199
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Sometimes the test fail because the temporary directory names are same 
> (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390).
> {code}
> File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in 
> pyspark.sql.dataframe.DataFrame.writeStream
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Create a table with Rate source.
> df.writeStream.toTable(
> "my_table", checkpointLocation=d)
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.11/doctest.py", line 1353, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 1, in 
> with tempfile.TemporaryDirectory() as d:
>   File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__
> self.cleanup()
>   File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup
> self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
>   File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree
> _rmtree(name, onerror=onerror)
>   File "/usr/lib/python3.11/shutil.py", line 738, in rmtree
> onerror(os.rmdir, path, sys.exc_info())
>   File "/usr/lib/python3.11/shutil.py", line 736, in rmtree
> os.rmdir(path, dir_fd=dir_fd)
> OSError: [Errno 39] Directory not empty: 
> '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45200) Spark 3.4.0 always using default log4j profile

2024-02-27 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-45200.
--
Resolution: Not A Problem

> Spark 3.4.0 always using default log4j profile
> --
>
> Key: SPARK-45200
> URL: https://issues.apache.org/jira/browse/SPARK-45200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Jitin Dominic
>Priority: Major
>  Labels: pull-request-available
>
> I've been using Spark core 3.2.2 and was upgrading to 3.4.0
> On execution of my Java code with the 3.4.0,  it generates some extra set of 
> logs but don't face this issue with 3.2.2.
>  
> I noticed that logs says _Using Spark's default log4j profile: 
> org/apache/spark/log4j2-defaults.properties._
>  
> Is this a bug or do we have a  a configuration to disable the using of 
> default log4j profile?
> I didn't see anything in the documentation
>  
> +*Output:*+
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j2-defaults.properties
> 23/09/18 20:05:08 INFO SparkContext: Running Spark version 3.4.0
> 23/09/18 20:05:08 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/09/18 20:05:08 INFO ResourceUtils: 
> ==
> 23/09/18 20:05:08 INFO ResourceUtils: No custom resources configured for 
> spark.driver.
> 23/09/18 20:05:08 INFO ResourceUtils: 
> ==
> 23/09/18 20:05:08 INFO SparkContext: Submitted application: XYZ
> 23/09/18 20:05:08 INFO ResourceProfile: Default ResourceProfile created, 
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
> memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: 
> offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: 
> cpus, amount: 1.0)
> 23/09/18 20:05:08 INFO ResourceProfile: Limiting resource is cpu
> 23/09/18 20:05:08 INFO ResourceProfileManager: Added ResourceProfile id: 0
> 23/09/18 20:05:08 INFO SecurityManager: Changing view acls to: jd
> 23/09/18 20:05:08 INFO SecurityManager: Changing modify acls to: jd
> 23/09/18 20:05:08 INFO SecurityManager: Changing view acls groups to: 
> 23/09/18 20:05:08 INFO SecurityManager: Changing modify acls groups to: 
> 23/09/18 20:05:08 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: jd; groups with view 
> permissions: EMPTY; users with modify permissions: jd; groups with modify 
> permissions: EMPTY
> 23/09/18 20:05:08 INFO Utils: Successfully started service 'sparkDriver' on 
> port 39155.
> 23/09/18 20:05:08 INFO SparkEnv: Registering MapOutputTracker
> 23/09/18 20:05:08 INFO SparkEnv: Registering BlockManagerMaster
> 23/09/18 20:05:08 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 23/09/18 20:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 23/09/18 20:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 23/09/18 20:05:08 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-226d599c-1511-4fae-b0e7-aae81b684012
> 23/09/18 20:05:08 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MiB
> 23/09/18 20:05:08 INFO SparkEnv: Registering OutputCommitCoordinator
> 23/09/18 20:05:08 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
> 23/09/18 20:05:09 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 23/09/18 20:05:09 INFO Executor: Starting executor ID driver on host jd
> 23/09/18 20:05:09 INFO Executor: Starting executor with user classpath 
> (userClassPathFirst = false): ''
> 23/09/18 20:05:09 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32819.
> 23/09/18 20:05:09 INFO NettyBlockTransferService: Server created on jd:32819
> 23/09/18 20:05:09 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 23/09/18 20:05:09 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, jd, 32819, None)
> 23/09/18 20:05:09 INFO BlockManagerMasterEndpoint: Registering block manager 
> jd:32819 with 2004.6 MiB RAM, BlockManagerId(driver, jd, 32819, None)
> 23/09/18 20:05:09 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, jd, 32819, None)
> 23/09/18 20:05:09 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(driver, jd, 32819, None)
> {code}
>  
>  
>  
> I'm using Spark core dependency in one of my Jars, the jar code contains the 
> following:
>  
> *+Code:+*
> {code:java}
> SparkSession
>

[jira] [Resolved] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource

2024-02-27 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46766.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

resolved by https://github.com/apache/spark/pull/44792

> ZSTD Buffer Pool Support For AVRO datasource
> 
>
> Key: SPARK-46766
> URL: https://issues.apache.org/jira/browse/SPARK-46766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource

2024-02-27 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-46766:


Assignee: Kent Yao

> ZSTD Buffer Pool Support For AVRO datasource
> 
>
> Key: SPARK-46766
> URL: https://issues.apache.org/jira/browse/SPARK-46766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47197:
---
Labels: pull-request-available  (was: )

> Failed to connect HiveMetastore when using iceberg with HiveCatalog on 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>  Labels: pull-request-available
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog on spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
>         at

[jira] [Assigned] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47192:


Assignee: Serge Rielau

> Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
> --
>
> Key: SPARK-47192
> URL: https://issues.apache.org/jira/browse/SPARK-47192
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> Old:
> > GRANT ROLE;
> _LEGACY_ERROR_TEMP_0035
> Operation not allowed: grant role. (line 1, pos 0)
>  
> New: 
> error class: HIVE_OPERATION_NOT_SUPPORTED
> The Hive operation  is not supported. (line 1, pos 0)
>  
> sqlstate: 0A000



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47192.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45291
[https://github.com/apache/spark/pull/45291]

> Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
> --
>
> Key: SPARK-47192
> URL: https://issues.apache.org/jira/browse/SPARK-47192
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Old:
> > GRANT ROLE;
> _LEGACY_ERROR_TEMP_0035
> Operation not allowed: grant role. (line 1, pos 0)
>  
> New: 
> error class: HIVE_OPERATION_NOT_SUPPORTED
> The Hive operation  is not supported. (line 1, pos 0)
>  
> sqlstate: 0A000



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47206:
---
Labels: pull-request-available  (was: )

> Add official image Dockerfile for Apache Spark 3.5.1
> 
>
> Key: SPARK-47206
> URL: https://issues.apache.org/jira/browse/SPARK-47206
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.5.1
>Reporter: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1

2024-02-27 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-47206:

Summary: Add official image Dockerfile for Apache Spark 3.5.1  (was: 
[SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.1)

> Add official image Dockerfile for Apache Spark 3.5.1
> 
>
> Key: SPARK-47206
> URL: https://issues.apache.org/jira/browse/SPARK-47206
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.5.1
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47206) [SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.1

2024-02-27 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-47206:
---

 Summary: [SPARK-45169] Add official image Dockerfile for Apache 
Spark 3.5.1
 Key: SPARK-47206
 URL: https://issues.apache.org/jira/browse/SPARK-47206
 Project: Spark
  Issue Type: Bug
  Components: Spark Docker
Affects Versions: 3.5.1
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47205) Upgrade docker-java to 3.3.5

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47205:
---
Labels: pull-request-available  (was: )

> Upgrade docker-java to 3.3.5
> 
>
> Key: SPARK-47205
> URL: https://issues.apache.org/jira/browse/SPARK-47205
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47155) Fix incorrect error class in create_data_source.py

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47155:
---
Labels: pull-request-available  (was: )

> Fix incorrect error class in create_data_source.py
> --
>
> Key: SPARK-47155
> URL: https://issues.apache.org/jira/browse/SPARK-47155
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> This is part of the effort of SPARK-44076 that enable python user to develop 
> data source in python and make python accessible to wider python community.
> Error class "PYTHON_DATA_SOURCE_CREATE_ERROR" and 
> "PYTHON_DATA_SOURCE_METHOD_NOT_IMPLEMENTED" used in create_data_source.py 
> doesn't exist and will throw error class not found error which make error 
> message confusing to customer. Need to add the corresponding error class to 
> error-class.py or use existing error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47205) Upgrade docker-java to 3.3.5

2024-02-27 Thread Kent Yao (Jira)

Kent Yao created SPARK-47205:


 Summary: Upgrade docker-java to 3.3.5
 Key: SPARK-47205
 URL: https://issues.apache.org/jira/browse/SPARK-47205
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Docker
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2024-02-27 Thread Goutam Ghosh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821516#comment-17821516
 ] 

Goutam Ghosh edited comment on SPARK-21918 at 2/28/24 5:58 AM:
---

Can the patch by [~angerszhuuu] be validated ? 


was (Author: goutamghosh):
Can the path by [~angerszhuuu] be validated ? 

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>  Labels: bulk-closed
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2024-02-27 Thread Goutam Ghosh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821516#comment-17821516
 ] 

Goutam Ghosh commented on SPARK-21918:
--

Can the path by [~angerszhuuu] be validated ? 

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>  Labels: bulk-closed
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47191.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45289
[https://github.com/apache/spark/pull/45289]

> avoid unnecessary relation lookup when uncaching table/view
> ---
>
> Key: SPARK-47191
> URL: https://issues.apache.org/jira/browse/SPARK-47191
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47191:
---

Assignee: Wenchen Fan

> avoid unnecessary relation lookup when uncaching table/view
> ---
>
> Key: SPARK-47191
> URL: https://issues.apache.org/jira/browse/SPARK-47191
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47187) Fix hive compress output config does not work

2024-02-27 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-47187.
---
Fix Version/s: 3.4.3
   Resolution: Fixed

Issue resolved by pull request 45286
[https://github.com/apache/spark/pull/45286]

> Fix hive compress output config does not work
> -
>
> Key: SPARK-47187
> URL: https://issues.apache.org/jira/browse/SPARK-47187
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47187) Fix hive compress output config does not work

2024-02-27 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-47187:
-

Assignee: XiDuo You

> Fix hive compress output config does not work
> -
>
> Key: SPARK-47187
> URL: https://issues.apache.org/jira/browse/SPARK-47187
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'

2024-02-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47202:


Assignee: Arzav Jain

> AttributeError: module 'pandas' has no attribute 'Timstamp'
> ---
>
> Key: SPARK-47202
> URL: https://issues.apache.org/jira/browse/SPARK-47202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Arzav Jain
>Assignee: Arzav Jain
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When using the pyspark.sql.types.TimestampType, if your value is a 
> datetime.datetime object with a tzinfo, [this 
> typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996]
>  breaks things.
>  
> I believe [this 
> commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221]
>  introduced the bug 9 months ago
>  
> Full stack trace below:
>  
> {code:java}
> File "/databricks/spark/python/pyspark/worker.py", line 1490, in main 
> process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in 
> process serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in 
> dump_stream return ArrowStreamSerializer.dump_stream( File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in 
> init_stream_yield_batches batch = self._create_batch(series) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in 
> _create_batch arrs.append(self._create_array(s, t, 
> arrow_cast=self._arrow_cast)) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in 
> _create_array series = conv(series) File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in 
>  return lambda pser: pser.apply( # type: ignore[return-value] File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 
> 4771, in apply return SeriesApply(self, func, convert_dtype, args, 
> kwargs).apply() File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
> 1123, in apply return self.apply_standard() File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
> 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", 
> line 2924, in pandas._libs.lib.map_infer File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in 
>  lambda x: conv(x) if x is not None else None # type: ignore[misc] 
> File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in 
> convert_array return [ File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in 
>  _element_conv(v) if v is not None else None # type: ignore[misc] 
> File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in 
> convert_struct return { File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in 
>  name: conv(v) if conv is not None and v is not None else v File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in 
> convert_timestamp ts = pd.Timstamp(value) File 
> "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 
> 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute 
> '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp'
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'

2024-02-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47202.
--
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45301
[https://github.com/apache/spark/pull/45301]

> AttributeError: module 'pandas' has no attribute 'Timstamp'
> ---
>
> Key: SPARK-47202
> URL: https://issues.apache.org/jira/browse/SPARK-47202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Arzav Jain
>Assignee: Arzav Jain
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When using the pyspark.sql.types.TimestampType, if your value is a 
> datetime.datetime object with a tzinfo, [this 
> typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996]
>  breaks things.
>  
> I believe [this 
> commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221]
>  introduced the bug 9 months ago
>  
> Full stack trace below:
>  
> {code:java}
> File "/databricks/spark/python/pyspark/worker.py", line 1490, in main 
> process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in 
> process serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in 
> dump_stream return ArrowStreamSerializer.dump_stream( File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in 
> init_stream_yield_batches batch = self._create_batch(series) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in 
> _create_batch arrs.append(self._create_array(s, t, 
> arrow_cast=self._arrow_cast)) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in 
> _create_array series = conv(series) File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in 
>  return lambda pser: pser.apply( # type: ignore[return-value] File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 
> 4771, in apply return SeriesApply(self, func, convert_dtype, args, 
> kwargs).apply() File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
> 1123, in apply return self.apply_standard() File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
> 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", 
> line 2924, in pandas._libs.lib.map_infer File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in 
>  lambda x: conv(x) if x is not None else None # type: ignore[misc] 
> File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in 
> convert_array return [ File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in 
>  _element_conv(v) if v is not None else None # type: ignore[misc] 
> File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in 
> convert_struct return { File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in 
>  name: conv(v) if conv is not None and v is not None else v File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in 
> convert_timestamp ts = pd.Timstamp(value) File 
> "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 
> 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute 
> '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp'
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47144) Fix Spark Connect collation issue

2024-02-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47144.
-
Resolution: Fixed

Issue resolved by pull request 45233
[https://github.com/apache/spark/pull/45233]

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47144) Fix Spark Connect collation issue

2024-02-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47144:
---

Assignee: Nikola Mandic

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-27 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47063.
--
Fix Version/s: 3.4.3
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45294
[https://github.com/apache/spark/pull/45294]

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Assignee: Pablo Langa Blanco
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3, 3.5.2, 4.0.0
>
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
>

[jira] [Assigned] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-27 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47063:


Assignee: Pablo Langa Blanco

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Assignee: Pablo Langa Blanco
>Priority: Major
>  Labels: pull-request-available
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue)
> res12: Long =

[jira] [Resolved] (SPARK-47204) [CORE] Check whether enabled checksum before delete checksum file

2024-02-27 Thread Binjie Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binjie Yang resolved SPARK-47204.
-
Resolution: Not A Problem

We will check whether the checksum file is exists or not before try to delete 
this file.

> [CORE] Check whether enabled checksum before delete checksum file
> -
>
> Key: SPARK-47204
> URL: https://issues.apache.org/jira/browse/SPARK-47204
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Binjie Yang
>Priority: Minor
> Fix For: 4.0.0
>
>
> We should check whether enabled shuffle checksum feature before we try to 
> find and delete this file. Unless, we will find Error deleting checksum 
> warning in log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47204) [CORE] Check whether enabled checksum before delete checksum file

2024-02-27 Thread Binjie Yang (Jira)

Binjie Yang created SPARK-47204:
---

 Summary: [CORE] Check whether enabled checksum before delete 
checksum file
 Key: SPARK-47204
 URL: https://issues.apache.org/jira/browse/SPARK-47204
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Binjie Yang
 Fix For: 4.0.0


We should check whether enabled shuffle checksum feature before we try to find 
and delete this file. Unless, we will find Error deleting checksum warning in 
log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47202:
---
Labels: pull-request-available  (was: )

> AttributeError: module 'pandas' has no attribute 'Timstamp'
> ---
>
> Key: SPARK-47202
> URL: https://issues.apache.org/jira/browse/SPARK-47202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Arzav Jain
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When using the pyspark.sql.types.TimestampType, if your value is a 
> datetime.datetime object with a tzinfo, [this 
> typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996]
>  breaks things.
>  
> I believe [this 
> commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221]
>  introduced the bug 9 months ago
>  
> Full stack trace below:
>  
> {code:java}
> File "/databricks/spark/python/pyspark/worker.py", line 1490, in main 
> process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in 
> process serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in 
> dump_stream return ArrowStreamSerializer.dump_stream( File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in 
> init_stream_yield_batches batch = self._create_batch(series) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in 
> _create_batch arrs.append(self._create_array(s, t, 
> arrow_cast=self._arrow_cast)) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in 
> _create_array series = conv(series) File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in 
>  return lambda pser: pser.apply( # type: ignore[return-value] File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 
> 4771, in apply return SeriesApply(self, func, convert_dtype, args, 
> kwargs).apply() File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
> 1123, in apply return self.apply_standard() File 
> "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
> 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", 
> line 2924, in pandas._libs.lib.map_infer File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in 
>  lambda x: conv(x) if x is not None else None # type: ignore[misc] 
> File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in 
> convert_array return [ File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in 
>  _element_conv(v) if v is not None else None # type: ignore[misc] 
> File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in 
> convert_struct return { File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in 
>  name: conv(v) if conv is not None and v is not None else v File 
> "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in 
> convert_timestamp ts = pd.Timstamp(value) File 
> "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 
> 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute 
> '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp'
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'

2024-02-27 Thread Arzav Jain (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arzav Jain updated SPARK-47202:
---
Description: 
When using the pyspark.sql.types.TimestampType, if your value is a 
datetime.datetime object with a tzinfo, [this 
typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996]
 breaks things.

 

I believe [this 
commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221]
 introduced the bug 9 months ago

 

Full stack trace below:

 
{code:java}
File "/databricks/spark/python/pyspark/worker.py", line 1490, in main process() 
File "/databricks/spark/python/pyspark/worker.py", line 1482, in process 
serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in 
dump_stream return ArrowStreamSerializer.dump_stream( File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in 
dump_stream for batch in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in 
init_stream_yield_batches batch = self._create_batch(series) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in 
_create_batch arrs.append(self._create_array(s, t, 
arrow_cast=self._arrow_cast)) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in 
_create_array series = conv(series) File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in  
return lambda pser: pser.apply( # type: ignore[return-value] File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 
4771, in apply return SeriesApply(self, func, convert_dtype, args, 
kwargs).apply() File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
1123, in apply return self.apply_standard() File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", 
line 2924, in pandas._libs.lib.map_infer File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in  
lambda x: conv(x) if x is not None else None # type: ignore[misc] File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in 
convert_array return [ File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in  
_element_conv(v) if v is not None else None # type: ignore[misc] File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in 
convert_struct return { File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in 
 name: conv(v) if conv is not None and v is not None else v File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in 
convert_timestamp ts = pd.Timstamp(value) File 
"/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 264, 
in __getattr__ raise AttributeError(f"module 'pandas' has no attribute 
'{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp'
 {code}
 

  was:
When using the pyspark.sql.types.TimestampType, if your value is a 
datetime.datetime object with a tzinfo, [this 
typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996]
 breaks things.

 

Full stack trace below:

 
{code:java}
File "/databricks/spark/python/pyspark/worker.py", line 1490, in main process() 
File "/databricks/spark/python/pyspark/worker.py", line 1482, in process 
serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in 
dump_stream return ArrowStreamSerializer.dump_stream( File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in 
dump_stream for batch in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in 
init_stream_yield_batches batch = self._create_batch(series) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in 
_create_batch arrs.append(self._create_array(s, t, 
arrow_cast=self._arrow_cast)) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in 
_create_array series = conv(series) File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in  
return lambda pser: pser.apply( # type: ignore[return-value] File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 
4771, in apply return SeriesApply(self, func, convert_dtype, args, 
kwargs).apply() File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
1123, in apply return self.apply_standard() File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", 
line 2924, in pandas._libs.lib.map_infer File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in  
lambda x: conv(x) if

[jira] [Assigned] (SPARK-47201) sameSemantics check input types

2024-02-27 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-47201:
-

Assignee: Ruifeng Zheng

> sameSemantics check input types
> ---
>
> Key: SPARK-47201
> URL: https://issues.apache.org/jira/browse/SPARK-47201
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47201) sameSemantics check input types

2024-02-27 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-47201.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45300
[https://github.com/apache/spark/pull/45300]

> sameSemantics check input types
> ---
>
> Key: SPARK-47201
> URL: https://issues.apache.org/jira/browse/SPARK-47201
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'

2024-02-27 Thread Arzav Jain (Jira)

Arzav Jain created SPARK-47202:
--

 Summary: AttributeError: module 'pandas' has no attribute 
'Timstamp'
 Key: SPARK-47202
 URL: https://issues.apache.org/jira/browse/SPARK-47202
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.5.1
Reporter: Arzav Jain


When using the pyspark.sql.types.TimestampType, if your value is a 
datetime.datetime object with a tzinfo, [this 
typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996]
 breaks things.

 

Full stack trace below:

 
{code:java}
File "/databricks/spark/python/pyspark/worker.py", line 1490, in main process() 
File "/databricks/spark/python/pyspark/worker.py", line 1482, in process 
serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in 
dump_stream return ArrowStreamSerializer.dump_stream( File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in 
dump_stream for batch in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in 
init_stream_yield_batches batch = self._create_batch(series) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in 
_create_batch arrs.append(self._create_array(s, t, 
arrow_cast=self._arrow_cast)) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in 
_create_array series = conv(series) File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in  
return lambda pser: pser.apply( # type: ignore[return-value] File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 
4771, in apply return SeriesApply(self, func, convert_dtype, args, 
kwargs).apply() File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
1123, in apply return self.apply_standard() File 
"/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 
1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", 
line 2924, in pandas._libs.lib.map_infer File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in  
lambda x: conv(x) if x is not None else None # type: ignore[misc] File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in 
convert_array return [ File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in  
_element_conv(v) if v is not None else None # type: ignore[misc] File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in 
convert_struct return { File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in 
 name: conv(v) if conv is not None and v is not None else v File 
"/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in 
convert_timestamp ts = pd.Timstamp(value) File 
"/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 264, 
in __getattr__ raise AttributeError(f"module 'pandas' has no attribute 
'{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp'
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47201) sameSemantics check input types

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47201:
---
Labels: pull-request-available  (was: )

> sameSemantics check input types
> ---
>
> Key: SPARK-47201
> URL: https://issues.apache.org/jira/browse/SPARK-47201
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47201) sameSemantics check input types

2024-02-27 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-47201:
-

 Summary: sameSemantics check input types
 Key: SPARK-47201
 URL: https://issues.apache.org/jira/browse/SPARK-47201
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47186) Improve the debuggability for docker integration test

2024-02-27 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47186.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45284
[https://github.com/apache/spark/pull/45284]

> Improve the debuggability for docker integration test
> -
>
> Key: SPARK-47186
> URL: https://issues.apache.org/jira/browse/SPARK-47186
> Project: Spark
>  Issue Type: Test
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47196.
---
Fix Version/s: 3.4.3
   Resolution: Fixed

Issue resolved by pull request 45295
[https://github.com/apache/spark/pull/45295]

> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.4.0, 3.4.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3
>
>
> This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay.
> {code:java}
> $ build/sbt "core/testOnly *.DAGSchedulerSuite"
> [info] DAGSchedulerSuite:
> [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
> (439 milliseconds)
> [info]   java.lang.IllegalStateException: Could not initialize plugin: 
> interface org.mockito.plugins.MockMaker (alternate: null)
> ...
> [info] *** 1 SUITE ABORTED ***
> [info] *** 118 TESTS FAILED ***
> [error] Error during tests:
> [error]   org.apache.spark.scheduler.DAGSchedulerSuite
> [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
>  
> MAVEN
> {code:java}
> $ build/mvn dependency:tree -pl core | grep byte-buddy
> ...
> [INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
> [INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
> {code}
> SBT
> {code:java}
> $ build/sbt "core/test:dependencyTree" | grep byte-buddy
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47196:
-

Assignee: Dongjoon Hyun

> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.4.0, 3.4.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay.
> {code:java}
> $ build/sbt "core/testOnly *.DAGSchedulerSuite"
> [info] DAGSchedulerSuite:
> [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
> (439 milliseconds)
> [info]   java.lang.IllegalStateException: Could not initialize plugin: 
> interface org.mockito.plugins.MockMaker (alternate: null)
> ...
> [info] *** 1 SUITE ABORTED ***
> [info] *** 118 TESTS FAILED ***
> [error] Error during tests:
> [error]   org.apache.spark.scheduler.DAGSchedulerSuite
> [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
>  
> MAVEN
> {code:java}
> $ build/mvn dependency:tree -pl core | grep byte-buddy
> ...
> [INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
> [INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
> {code}
> SBT
> {code:java}
> $ build/sbt "core/test:dependencyTree" | grep byte-buddy
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47200) Handle and classify errors from ForEachBatchSink user function

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47200:
---
Labels: pull-request-available  (was: )

> Handle and classify errors from ForEachBatchSink user function
> --
>
> Key: SPARK-47200
> URL: https://issues.apache.org/jira/browse/SPARK-47200
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: B. Micheal Okutubo
>Priority: Major
>  Labels: pull-request-available
>
> Any exception can be thrown from the user provided function for 
> ForEachBatchSink. We want to classify this class of errors. Including errors 
> from Python and Scala functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47200) Handle and classify errors from ForEachBatchSink user function

2024-02-27 Thread B. Micheal Okutubo (Jira)

B. Micheal Okutubo created SPARK-47200:
--

 Summary: Handle and classify errors from ForEachBatchSink user 
function
 Key: SPARK-47200
 URL: https://issues.apache.org/jira/browse/SPARK-47200
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: B. Micheal Okutubo


Any exception can be thrown from the user provided function for 
ForEachBatchSink. We want to classify this class of errors. Including errors 
from Python and Scala functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47199:
---
Labels: pull-request-available  (was: )

> Add prefix into TemporaryDirectory to avoid flakiness
> -
>
> Key: SPARK-47199
> URL: https://issues.apache.org/jira/browse/SPARK-47199
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> Sometimes the test fail because the temporary directory names are same 
> (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390).
> {code}
> File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in 
> pyspark.sql.dataframe.DataFrame.writeStream
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Create a table with Rate source.
> df.writeStream.toTable(
> "my_table", checkpointLocation=d)
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.11/doctest.py", line 1353, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 1, in 
> with tempfile.TemporaryDirectory() as d:
>   File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__
> self.cleanup()
>   File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup
> self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
>   File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree
> _rmtree(name, onerror=onerror)
>   File "/usr/lib/python3.11/shutil.py", line 738, in rmtree
> onerror(os.rmdir, path, sys.exc_info())
>   File "/usr/lib/python3.11/shutil.py", line 736, in rmtree
> os.rmdir(path, dir_fd=dir_fd)
> OSError: [Errno 39] Directory not empty: 
> '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness

2024-02-27 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-47199:


 Summary: Add prefix into TemporaryDirectory to avoid flakiness
 Key: SPARK-47199
 URL: https://issues.apache.org/jira/browse/SPARK-47199
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


Sometimes the test fail because the temporary directory names are same 
(https://github.com/apache/spark/actions/runs/8066850485/job/22036007390).

{code}
File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in 
pyspark.sql.dataframe.DataFrame.writeStream
Failed example:
with tempfile.TemporaryDirectory() as d:
# Create a table with Rate source.
df.writeStream.toTable(
"my_table", checkpointLocation=d)
Exception raised:
Traceback (most recent call last):
  File "/usr/lib/python3.11/doctest.py", line 1353, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, 
in 
with tempfile.TemporaryDirectory() as d:
  File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__
self.cleanup()
  File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup
self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree
_rmtree(name, onerror=onerror)
  File "/usr/lib/python3.11/shutil.py", line 738, in rmtree
onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.11/shutil.py", line 736, in rmtree
os.rmdir(path, dir_fd=dir_fd)
OSError: [Errno 39] Directory not empty: 
'/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq'
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?

2024-02-27 Thread melin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-47198:
--
Description: 
spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] 
path, forwarding to different sparkapp ui console based on sparkappid. spark 
apps are dynamically added and decreased. ingress Dynamically adds spark svc.

[sparkappid]_svc == spark svc name

[https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html]

  was:
spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] 
path, forwarding to different sparkapp ui console based on sparkappid. spark 
apps are dynamically added and decreased. ingress Dynamically adds spark svc.

sparkappid == spark svc name

[https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html]


> Is it possible to dynamically add backend service to ingress with Kubernetes?
> -
>
> Key: SPARK-47198
> URL: https://issues.apache.org/jira/browse/SPARK-47198
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: melin
>Priority: Major
>
> spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] 
> path, forwarding to different sparkapp ui console based on sparkappid. spark 
> apps are dynamically added and decreased. ingress Dynamically adds spark svc.
> [sparkappid]_svc == spark svc name
> [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?

2024-02-27 Thread melin (Jira)

melin created SPARK-47198:
-

 Summary: Is it possible to dynamically add backend service to 
ingress with Kubernetes?
 Key: SPARK-47198
 URL: https://issues.apache.org/jira/browse/SPARK-47198
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: melin


spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] 
path, forwarding to different sparkapp ui console based on sparkappid. spark 
apps are dynamically added and decreased. ingress Dynamically adds spark svc.

sparkappid == spark svc name

[https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47114) In the spark driver pod. Failed to access the krb5 file

2024-02-27 Thread melin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin resolved SPARK-47114.
---
Resolution: Resolved

> In the spark driver pod. Failed to access the krb5 file
> ---
>
> Key: SPARK-47114
> URL: https://issues.apache.org/jira/browse/SPARK-47114
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: melin
>Priority: Major
>
> spark runs in kubernetes and accesses an external hdfs cluster (kerberos)，pod 
> error logs
> {code:java}
> Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf 
> loading failed{code}
> This error generally occurs when the krb5 file cannot be found
> [~yao] [~Qin Yao] 
> {code:java}
> ./bin/spark-submit \
>     --master k8s://https://172.18.5.44:6443 \
>     --deploy-mode cluster \
>     --name spark-pi \
>     --class org.apache.spark.examples.SparkPi \
>     --conf spark.executor.instances=1 \
>     --conf spark.kubernetes.submission.waitAppCompletion=true \
>     --conf spark.kubernetes.driver.pod.name=spark-xxx \
>     --conf spark.kubernetes.executor.podNamePrefix=spark-executor-xxx \
>     --conf spark.kubernetes.driver.label.profile=production \
>     --conf spark.kubernetes.executor.label.profile=production \
>     --conf spark.kubernetes.namespace=superior \
>     --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>     --conf 
> spark.kubernetes.container.image=registry.cn-hangzhou.aliyuncs.com/melin1204/spark-jobserver:3.4.0
>  \
>     --conf 
> spark.kubernetes.file.upload.path=hdfs://cdh1:8020/user/superior/kubernetes/ \
>     --conf spark.kubernetes.container.image.pullPolicy=Always \
>     --conf spark.kubernetes.container.image.pullSecrets=docker-reg-demos \
>     --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf  \
>     --conf spark.kerberos.principal=superior/ad...@datacyber.com  \
>     --conf spark.kerberos.keytab=/root/superior.keytab  \
>     
> file:///root/spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar
>   5{code}
> {code:java}
> (base) [root@cdh1 ~]# kubectl logs spark-xxx -n superior
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get 
> Kerberos realm
>         at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:71)
>         at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
>         at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
>         at 
> org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:395)
>         at 
> org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:389)
>         at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1119)
>         at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:385)
>         at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>         at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
>         at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
>         at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
>         at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf 
> loading failed
>         at 
> java.security.jgss/javax.security.auth.kerberos.KerberosPrincipal.(Unknown
>  Source)
>         at 
> org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:120)
>         at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:69)
>         ... 13 more
> (base) [root@cdh1 ~]# kubectl describe pod spark-xxx -n superior
> Name:             spark-xxx
> Namespace:        superior
> Priority:         0
> Service Account:  spark
> Node:             cdh2/172.18.5.45
> Start Time:       Wed, 21 Feb 2024 15:48:08 +0800
> Labels:           profile=production
>                   spark-app-name=spark-pi
>                   spark-app-selector=spark-728e24e49f9040fa86b04c521463020b
>                   spark-role=driver
>                   spark-version=3.4.2
> Annotations:      
> Status:           Failed
> IP:               10.244.1.4
> IPs:
>   IP:  10.244.1.4
> Containers:
>   spark-kubernetes-driver:
>     Container ID:  
>

[jira] [Commented] (SPARK-47114) In the spark driver pod. Failed to access the krb5 file

2024-02-27 Thread melin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821469#comment-17821469
 ] 

melin commented on SPARK-47114:
---

默认jre17 不支持kerberos，换成jdk 

> In the spark driver pod. Failed to access the krb5 file
> ---
>
> Key: SPARK-47114
> URL: https://issues.apache.org/jira/browse/SPARK-47114
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: melin
>Priority: Major
>
> spark runs in kubernetes and accesses an external hdfs cluster (kerberos)，pod 
> error logs
> {code:java}
> Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf 
> loading failed{code}
> This error generally occurs when the krb5 file cannot be found
> [~yao] [~Qin Yao] 
> {code:java}
> ./bin/spark-submit \
>     --master k8s://https://172.18.5.44:6443 \
>     --deploy-mode cluster \
>     --name spark-pi \
>     --class org.apache.spark.examples.SparkPi \
>     --conf spark.executor.instances=1 \
>     --conf spark.kubernetes.submission.waitAppCompletion=true \
>     --conf spark.kubernetes.driver.pod.name=spark-xxx \
>     --conf spark.kubernetes.executor.podNamePrefix=spark-executor-xxx \
>     --conf spark.kubernetes.driver.label.profile=production \
>     --conf spark.kubernetes.executor.label.profile=production \
>     --conf spark.kubernetes.namespace=superior \
>     --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>     --conf 
> spark.kubernetes.container.image=registry.cn-hangzhou.aliyuncs.com/melin1204/spark-jobserver:3.4.0
>  \
>     --conf 
> spark.kubernetes.file.upload.path=hdfs://cdh1:8020/user/superior/kubernetes/ \
>     --conf spark.kubernetes.container.image.pullPolicy=Always \
>     --conf spark.kubernetes.container.image.pullSecrets=docker-reg-demos \
>     --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf  \
>     --conf spark.kerberos.principal=superior/ad...@datacyber.com  \
>     --conf spark.kerberos.keytab=/root/superior.keytab  \
>     
> file:///root/spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar
>   5{code}
> {code:java}
> (base) [root@cdh1 ~]# kubectl logs spark-xxx -n superior
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get 
> Kerberos realm
>         at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:71)
>         at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
>         at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
>         at 
> org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:395)
>         at 
> org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:389)
>         at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1119)
>         at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:385)
>         at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>         at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
>         at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
>         at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
>         at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf 
> loading failed
>         at 
> java.security.jgss/javax.security.auth.kerberos.KerberosPrincipal.(Unknown
>  Source)
>         at 
> org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:120)
>         at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:69)
>         ... 13 more
> (base) [root@cdh1 ~]# kubectl describe pod spark-xxx -n superior
> Name:             spark-xxx
> Namespace:        superior
> Priority:         0
> Service Account:  spark
> Node:             cdh2/172.18.5.45
> Start Time:       Wed, 21 Feb 2024 15:48:08 +0800
> Labels:           profile=production
>                   spark-app-name=spark-pi
>                   spark-app-selector=spark-728e24e49f9040fa86b04c521463020b
>                   spark-role=driver
>                   spark-version=3.4.2
> Annotations:      
> Status:           Failed
> IP:               10.244.1.4
> IPs:
>   IP:  10.244.1.4
> Containers:
>   spark-kubernetes-driver:
>     Container ID:  
>

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Description: 
I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
on spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 
{code:java}
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
...
Caused by: MetaException(message:Could not connect to meta store using any of 
the URIs provided. Most recent failure: 
org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Summary: Failed to connect HiveMetastore when using iceberg with 
HiveCatalog on spark-sql or spark-shell  (was: Failed to connect HiveMetastore 
when using iceberg with HiveCatalog by spark-sql or spark-shell)

> Failed to connect HiveMetastore when using iceberg with HiveCatalog on 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog by spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Component/s: SQL

> Failed to connect HiveMetastore when using iceberg with HiveCatalog by 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog by spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
>         at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
>         at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
>

[jira] [Commented] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-02-27 Thread Steve Weis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821449#comment-17821449
 ] 

Steve Weis commented on SPARK-47172:


Hi [~mridul]. I'd like to get you input on this now that there is TLS support 
for RPC calls. The "AES based encryption" that is using AES-CTR is not really 
needed anymore. 

It's not a 1-line change to fix because Apache Commons 
CryptoInputStream/CryptoOutputStream don't support GCM. We can move to the JCE 
CipherInputStream/CipherOutputStream, but it's not a drop in replacement. 

I'm wondering if we can just plan to deprecate the ad hoc transport encryption 
and standardize on TLS.

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Description: 
I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
by spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 
{code:java}
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
...
Caused by: MetaException(message:Could not connect to meta store using any of 
the URIs provided. Most recent failure: 
org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Description: 
I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
by spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 

 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at

[jira] [Created] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)

YUBI LEE created SPARK-47197:


 Summary: Failed to connect HiveMetastore when using iceberg with 
HiveCatalog by spark-sql or spark-shell
 Key: SPARK-47197
 URL: https://issues.apache.org/jira/browse/SPARK-47197
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.5.1, 3.2.3
Reporter: YUBI LEE


I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
by spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 

 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at

[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2024-02-27 Thread Mich Talebzadeh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821433#comment-17821433
 ] 

Mich Talebzadeh commented on SPARK-24815:
-

some thoughts on this if I may

 

This enhancement request provides a solid foundation for improving dynamic 
allocation in Structured Streaming. Adding more specific details, outlining 
potential benefits, and addressing potential challenges can further strengthen 
the proposal and increase its chances of being implemented.

So these are my thoughts:



- Pluggable Dynamic Allocation: This suggestion shows good design principles, 
allowing for flexibility and future improvements. We should elaborate benefits 
of a pluggable approach, like customization and integration with external 
resource management tools.

- Separate Algorithm for Structured Streaming: This is crucial for adapting 
allocation strategies to the unique nature of streaming workloads Also  
outlining how a separate algorithm might differ from the batch counterpart 
could be useful

- Warning for Enabled Core Dynamic Allocation: This is a valuable warning to 
prevent accidental misuse and raise awareness among users. Also consider 
suggesting the warning level (e.g. info, warning, error) and potential content 
to provide clarity.

- Briefly mention potential challenges or trade-offs associated with 
implementing these proposals. Suggesting relevant discussions, resources, or 
alternative approaches could strengthen the request for enhancement

 

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>  Labels: pull-request-available
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47196:
--
Description: 
This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay.
{code:java}
$ build/sbt "core/testOnly *.DAGSchedulerSuite"
[info] DAGSchedulerSuite:
[info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
(439 milliseconds)
[info]   java.lang.IllegalStateException: Could not initialize plugin: 
interface org.mockito.plugins.MockMaker (alternate: null)
...
[info] *** 1 SUITE ABORTED ***
[info] *** 118 TESTS FAILED ***
[error] Error during tests:
[error] org.apache.spark.scheduler.DAGSchedulerSuite
[error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
 

MAVEN
{code:java}
$ build/mvn dependency:tree -pl core | grep byte-buddy
...
[INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
[INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
{code}
SBT
{code:java}
$ build/sbt "core/test:dependencyTree" | grep byte-buddy
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
{code}

  was:
This happens at branch-3.4 only.
{code:java}
$ build/sbt "core/testOnly *.DAGSchedulerSuite"
[info] DAGSchedulerSuite:
[info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
(439 milliseconds)
[info]   java.lang.IllegalStateException: Could not initialize plugin: 
interface org.mockito.plugins.MockMaker (alternate: null)
...
[info] *** 1 SUITE ABORTED ***
[info] *** 118 TESTS FAILED ***
[error] Error during tests:
[error] org.apache.spark.scheduler.DAGSchedulerSuite
[error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
 

MAVEN
{code:java}
$ build/mvn dependency:tree -pl core | grep byte-buddy
...
[INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
[INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
{code}
SBT
{code:java}
$ build/sbt "core/test:dependencyTree" | grep byte-buddy
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
{code}


> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.4.0, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay.
> {code:java}
> $ build/sbt "core/testOnly *.DAGSchedulerSuite"
> [info] DAGSchedulerSuite:
> [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
> (439 milliseconds)
> [info]   java.lang.IllegalStateException: Could not initialize plugin: 
> interface org.mockito.plugins.MockMaker (alternate: null)
> ...
> [info] *** 1 SUITE ABORTED ***
> [info] *** 118 TESTS FAILED ***
> [error] Error during tests:
> [error]   org.apache.spark.scheduler.DAGSchedulerSuite
> [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
>  
> MAVEN
> {code:java}
> $ build/mvn dependency:tree -pl core | grep byte-buddy
> ...
> [INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
> [INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
> {code}
> SBT
> {code:java}
> $ build/sbt "core/test:dependencyTree" | grep byte-buddy
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47196:
--
Description: 
This happens at branch-3.4 only.
{code:java}
$ build/sbt "core/testOnly *.DAGSchedulerSuite"
[info] DAGSchedulerSuite:
[info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
(439 milliseconds)
[info]   java.lang.IllegalStateException: Could not initialize plugin: 
interface org.mockito.plugins.MockMaker (alternate: null)
...
[info] *** 1 SUITE ABORTED ***
[info] *** 118 TESTS FAILED ***
[error] Error during tests:
[error] org.apache.spark.scheduler.DAGSchedulerSuite
[error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
 

MAVEN
{code:java}
$ build/mvn dependency:tree -pl core | grep byte-buddy
...
[INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
[INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
{code}
SBT
{code:java}
$ build/sbt "core/test:dependencyTree" | grep byte-buddy
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
{code}

  was:
MAVEN

{code}
$ build/mvn dependency:tree -pl core | grep byte-buddy
...
[INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
[INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
{code}

SBT

{code}
$ build/sbt "core/test:dependencyTree" | grep byte-buddy
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
{code}


> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.4.0, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> This happens at branch-3.4 only.
> {code:java}
> $ build/sbt "core/testOnly *.DAGSchedulerSuite"
> [info] DAGSchedulerSuite:
> [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
> (439 milliseconds)
> [info]   java.lang.IllegalStateException: Could not initialize plugin: 
> interface org.mockito.plugins.MockMaker (alternate: null)
> ...
> [info] *** 1 SUITE ABORTED ***
> [info] *** 118 TESTS FAILED ***
> [error] Error during tests:
> [error]   org.apache.spark.scheduler.DAGSchedulerSuite
> [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code}
>  
> MAVEN
> {code:java}
> $ build/mvn dependency:tree -pl core | grep byte-buddy
> ...
> [INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
> [INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
> {code}
> SBT
> {code:java}
> $ build/sbt "core/test:dependencyTree" | grep byte-buddy
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47196:
--
Description: 
MAVEN

{code}
$ build/mvn dependency:tree -pl core | grep byte-buddy
...
[INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
[INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
{code}

SBT

{code}
$ build/sbt "core/test:dependencyTree" | grep byte-buddy
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
[info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
{code}

> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.4.0, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> MAVEN
> {code}
> $ build/mvn dependency:tree -pl core | grep byte-buddy
> ...
> [INFO] |  +- net.bytebuddy:byte-buddy:jar:1.12.10:test
> [INFO] |  +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test
> {code}
> SBT
> {code}
> $ build/sbt "core/test:dependencyTree" | grep byte-buddy
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18)
> [info]   | | | | +-net.bytebuddy:byte-buddy:1.12.18
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-02-27 Thread Ziqi Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-47177:
-
Description: 
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan. 

*In short, the plan used to executed and the plan used to explain is not the 
same instance, thus causing the inconsistency.*

 

I don't have a clear idea how yet
 * maybe we just a coarse granularity lock in explain?
 * make innerChildren a function: clone the initial plan, every time checked 
for whether the original AQE plan is finalized (making the final flag atomic 
first, of course), if no return the cloned initial plan, if it's finalized, 
clone the final plan and return that one. But still this won't be able to 
reflect the AQE plan in real time, in a concurrent situation, but at least we 
have initial version and final version.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.filter("key > 10")
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
   +- TableCacheQueryStage 0
      +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
(key#4L > 10)]
            +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
memory, deserialized, 1 replicas)
                  +- AdaptiveSparkPlan isFinalPlan=false
                     +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
                        +- Exchange hashpartitioning(key#4L, 200), 
ENSURE_REQUIREMENTS, [plan_id=24]
                           +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                              +- Project [(id#2L % 100) AS key#4L]
                                 +- Range (0, 1000, step=1, splits=10)
+- == Initial Plan ==
   Filter (isnotnull(key#4L) AND (key#4L > 10))
   +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > 
10)]
         +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
memory, deserialized, 1 replicas)
               +- AdaptiveSparkPlan isFinalPlan=false
                  +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
                     +- Exchange hashpartitioning(key#4L, 200), 
ENSURE_REQUIREMENTS, [plan_id=24]
                        +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                           +- Project [(id#2L % 100) AS key#4L]
                              +- Range (0, 1000, step=1, splits=10){code}

  was:
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan. 

*In short, the plan used to executed and the plan used to explain is not the 
same instance, thus causing the inconsistency.*

 

I don't have a clear idea how yet
 * maybe we just a coarse granularity lock in explain?
 * make innerChildren a function: clone the initial plan, every time checked 
for whether the original AQE plan is finalized (making the final flag atomic 
first, of course), if no return the cloned initial plan, if it's finalized, 
clone the final plan and return that one. But still this won't be able to 
reflect the AQE plan in real time, in a concurrent situation, but at least we 
have initial version and final version.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])

[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47196:
--
Affects Version/s: 3.4.1
   3.4.0

> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.4.0, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43157) TreeNode tags can become corrupted and hang driver when the dataset is cached

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43157:
---
Labels: pull-request-available  (was: )

> TreeNode tags can become corrupted and hang driver when the dataset is cached
> -
>
> Key: SPARK-43157
> URL: https://issues.apache.org/jira/browse/SPARK-43157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 3.5.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
>
> If a cached dataset is used by multiple other datasets materialized in 
> separate threads it can corrupt the TreeNode.tags map in any of the cached 
> plan nodes. This will hang the driver forever. This happens because 
> TreeNode.tags is not thread-safe. How this happens:
>  # Multiple datasets are materialized at the same time in different threads 
> that reference the same cached dataset
>  # AdaptiveSparkPlanExec.onUpdatePlan will call ExplainMode.fromString
>  # ExplainUtils uses the TreeNode.tags map to store the operator Id for every 
> node in the plan. This is usually okay because the plan is cloned. When there 
> is an InMemoryScanExec the InMemoryRelation.cachedPlan is not cloned so 
> multiple threads can set the operator Id.
> Making the TreeNode.tags field thread-safe does not solve this problem 
> because there is still a correctness issue. The threads may be overwriting 
> each other's operator Ids, which could be different.
> Example stack trace of the infinite loop:
> {code:scala}
> scala.collection.mutable.HashTable.resize(HashTable.scala:265)
> scala.collection.mutable.HashTable.addEntry0(HashTable.scala:158)
> scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:170)
> scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> scala.collection.mutable.HashMap.put(HashMap.scala:126)
> scala.collection.mutable.HashMap.update(HashMap.scala:131)
> org.apache.spark.sql.catalyst.trees.TreeNode.setTagValue(TreeNode.scala:108)
> org.apache.spark.sql.execution.ExplainUtils$.setOpId$1(ExplainUtils.scala:134)
> …
> org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:662){code}
> Example to show the cachedPlan object is not cloned:
> {code:java}
> import org.apache.spark.sql.execution.SparkPlan
> import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
> import spark.implicits._
> def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = {
>   if (plan.isInstanceOf[InMemoryTableScanExec]) {
>     Some(plan.asInstanceOf[InMemoryTableScanExec])
>   } else if (plan.children.isEmpty && plan.subqueries.isEmpty) {
>     None
>   } else {
>     (plan.subqueries.flatMap(p => findCacheOperator(p)) ++
>       plan.children.flatMap(findCacheOperator)).headOption
>   }
> }
> val df = spark.range(10).filter($"id" < 100).cache()
> val df1 = df.limit(1)
> val df2 = df.limit(1)
> // Get the cache operator (InMemoryTableScanExec) in each plan
> val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get
> val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get
> // Check if InMemoryTableScanExec references point to the same object
> println(plan1.eq(plan2))
> // returns false// Check if InMemoryRelation references point to the same 
> object
> println(plan1.relation.eq(plan2.relation))
> // returns false
> // Check if the cached SparkPlan references point to the same object
> println(plan1.relation.cachedPlan.eq(plan2.relation.cachedPlan))
> // returns true
> // This shows that the cloned plan2 still has references to the original 
> plan1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-02-27 Thread Ziqi Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-47177:
-
Description: 
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan. 

*In short, the plan used to executed and the plan used to explain is not the 
same instance, thus causing the inconsistency.*

 

I don't have a clear idea how yet
 * maybe we just a coarse granularity lock in explain?
 * make innerChildren a function: clone the initial plan, every time checked 
for whether the original AQE plan is finalized (making the final flag atomic 
first, of course), if no return the cloned initial plan, if it's finalized, 
clone the final plan and return that one. But still this won't be able to 
reflect the AQE plan in real time, in a concurrent situation, but at least we 
have initial version and final version.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                                       +- Exchange hashpartitioning(key#4L, 
200), ENSURE_REQUIREMENTS, [plan_id=33]
                                          +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                             +- Project [(id#2L % 100) AS 
key#4L]
                                                +- Range (0, 1000, step=1, 
splits=10)
+- == Initial Plan ==
   HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
      +- HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)])
         +- Project [(key#27L % 10) AS key2#36L]
            +- InMemoryTableScan [key#27L]
                  +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                        +- AdaptiveSparkPlan isFinalPlan=false
                           +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                              +- Exchange hashpartitioning(key#4L, 200), 
ENSURE_REQUIREMENTS, [plan_id=33]
                                 +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                    +- Project [(id#2L % 100) AS key#4L]
                                       +- Range (0, 1000, step=1, splits=10) 
{code}

  was:
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan. 

*In short, the plan used to executed and the plan used to explain is not the 
same instance, thus causing the inconsistency.*

 

I don't have a clear idea how yet, maybe we just a coarse granularity lock in 
explain?

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +-

[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-02-27 Thread Ziqi Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-47177:
-
Description: 
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan. 

*In short, the plan used to executed and the plan used to explain is not the 
same instance, thus causing the inconsistency.*

 

I don't have a clear idea how yet, maybe we just a coarse granularity lock in 
explain?

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                                       +- Exchange hashpartitioning(key#4L, 
200), ENSURE_REQUIREMENTS, [plan_id=33]
                                          +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                             +- Project [(id#2L % 100) AS 
key#4L]
                                                +- Range (0, 1000, step=1, 
splits=10)
+- == Initial Plan ==
   HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
      +- HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)])
         +- Project [(key#27L % 10) AS key2#36L]
            +- InMemoryTableScan [key#27L]
                  +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                        +- AdaptiveSparkPlan isFinalPlan=false
                           +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                              +- Exchange hashpartitioning(key#4L, 200), 
ENSURE_REQUIREMENTS, [plan_id=33]
                                 +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                    +- Project [(id#2L % 100) AS key#4L]
                                       +- Range (0, 1000, step=1, splits=10) 
{code}

  was:
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan.  I don't have a clear idea how 
yet, maybe we can check whether the AQE plan is finalized(make the final flag 
atomic first, of course), if not we can return the cloned one, otherwise it's 
thread-safe to return the final one, since it's immutable.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                                       +- Exchange hashpartitioning(key#4L, 
200), ENSURE_REQUIREMENTS, [plan_id=33]

[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47196:
---
Labels: pull-request-available  (was: )

> Fix `core` module to succeed SBT tests
> --
>
> Key: SPARK-47196
> URL: https://issues.apache.org/jira/browse/SPARK-47196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47196) Fix `core` module to succeed SBT tests

2024-02-27 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47196:
-

 Summary: Fix `core` module to succeed SBT tests
 Key: SPARK-47196
 URL: https://issues.apache.org/jira/browse/SPARK-47196
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821393#comment-17821393
 ] 

Bruce Robbins edited comment on SPARK-47193 at 2/27/24 8:48 PM:


Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the "{{...}}" above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}



was (Author: bersprockets):
Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the {{...}} above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val

[jira] [Updated] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47063:
---
Labels: pull-request-available  (was: )

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Priority: Major
>  Labels: pull-request-available
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue)
> res12: Long = 9223372036854775807
> scala>

[jira] [Resolved] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43256.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45198
[https://github.com/apache/spark/pull/45198]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2021
> 
>
> Key: SPARK-43256
> URL: https://issues.apache.org/jira/browse/SPARK-43256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: A G
>Priority: Minor
>  Labels: pull-request-available, starter
> Fix For: 4.0.0
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43256:


Assignee: A G

> Assign a name to the error class _LEGACY_ERROR_TEMP_2021
> 
>
> Key: SPARK-43256
> URL: https://issues.apache.org/jira/browse/SPARK-43256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: A G
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821393#comment-17821393
 ] 

Bruce Robbins commented on SPARK-47193:
---

Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the {{...}} above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val timeZoneLookup = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage]
> val userLocation = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage]
> val user = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
>

[jira] [Commented] (SPARK-44389) ExecutorDeadException when using decommissioning without external shuffle service

2024-02-27 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821373#comment-17821373
 ] 

Dongjoon Hyun commented on SPARK-44389:
---

I guess this was fixed by reverting SPARK-43043 via SPARK-44630 .

Could you try this Apache Spark 3.4.2?



> ExecutorDeadException when using decommissioning without external shuffle 
> service
> -
>
> Key: SPARK-44389
> URL: https://issues.apache.org/jira/browse/SPARK-44389
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Volodymyr Kot
>Priority: Major
>
> Hey, we are trying to use executor decommissioning without external shuffle 
> service. We are trying to understand:
>  # How often should we expect to see ExecutorDeadException? How is 
> information about changes to location of blocks is propagated?
>  # Whether the task should be re-submited if we hit that during 
> decommissioning?
>  
> Current behavior that we observe:
>  # Executor 1 is decommissioned
>  # Driver successfully removes executor 1's block manager 
> [here|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L44]
>  # A task is started on executor 2
>  # We hit `ExecutorDeadException` on executor 2 when trying to fetch blocks 
> from executor 1 
> [here|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala#L139-L140]
>  # Task on executor 2 fails
>  # Stage fails
>  # Stage is re-submitted and succeeds
> As far as we understand, this happens because executor 2 has stale [map 
> status 
> cache|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1235-L1236]
> Is that expected behavior? Shouldn't the task be retried in that case instead 
> of whole stage failing and being retried? This makes Spark job execution 
> longer, especially if there are a lot of decommission events.
>  
> Also found [this 
> comment|https://github.palantir.build/foundry/spark/blob/aad028ae02011b079e8812f7e63869323cc1ed78/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L113-L115],
>  which makes sense for FetchFailures w/o decommissioning, but with 
> decommissioning data could have been migrated - and we need to fetch a new 
> location. Maybe it makes sense to special case this codepath to check whether 
> executor was decommissioned? Since 
> https://issues.apache.org/jira/browse/SPARK-40979 we already store that 
> information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44478) Executor decommission causes stage failure

2024-02-27 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821370#comment-17821370
 ] 

Dongjoon Hyun edited comment on SPARK-44478 at 2/27/24 6:51 PM:


Do you still see this issue with Apache Spark 3.4.2 or 3.5.1, [~dhuett]?


was (Author: dongjoon):
Do you still see this issue with Apache Spark 3.4.2, [~dhuett]?

> Executor decommission causes stage failure
> --
>
> Key: SPARK-44478
> URL: https://issues.apache.org/jira/browse/SPARK-44478
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Dale Huettenmoser
>Priority: Minor
>
> During spark execution, save fails due to executor decommissioning. Issue not 
> present in 3.3.0
> Sample error:
>  
> {code:java}
> An error occurred while calling o8948.save.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Authorized committer (attemptNumber=0, stage=170, partition=233) failed; but 
> task commit success, data duplication may happen. 
> reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is 
> decommissioned.))
>     at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199)
>     at scala.Option.foreach(Option.scala:407)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
>     at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
>     at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
>     at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
>     at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
>     at 
>

[jira] [Commented] (SPARK-44478) Executor decommission causes stage failure

2024-02-27 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821370#comment-17821370
 ] 

Dongjoon Hyun commented on SPARK-44478:
---

Do you still see this issue with Apache Spark 3.4.2, [~dhuett]?

> Executor decommission causes stage failure
> --
>
> Key: SPARK-44478
> URL: https://issues.apache.org/jira/browse/SPARK-44478
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Dale Huettenmoser
>Priority: Minor
>
> During spark execution, save fails due to executor decommissioning. Issue not 
> present in 3.3.0
> Sample error:
>  
> {code:java}
> An error occurred while calling o8948.save.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Authorized committer (attemptNumber=0, stage=170, partition=233) failed; but 
> task commit success, data duplication may happen. 
> reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is 
> decommissioned.))
>     at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199)
>     at scala.Option.foreach(Option.scala:407)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
>     at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
>     at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
>     at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
>     at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
>     at 
>

[jira] [Commented] (SPARK-47194) Upgrade log4j2 to 2.23.0

2024-02-27 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821367#comment-17821367
 ] 

Dongjoon Hyun commented on SPARK-47194:
---

Hi, [~LuciferYang]. Thank you for adding here.

SPARK-47046 is inclusive for all dependency upgrades. Feel free to add here.

This JIRA is an umbrella not to miss any *notable* dependency changes.

> Upgrade log4j2 to 2.23.0
> 
>
> Key: SPARK-47194
> URL: https://issues.apache.org/jira/browse/SPARK-47194
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names

2024-02-27 Thread A G (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820680#comment-17820680
 ] 

A G edited comment on SPARK-47033 at 2/27/24 6:08 PM:
--

I want to work on this! PR: https://github.com/apache/spark/pull/45293


was (Author: JIRAUSER304341):
I want to work on this!

> EXECUTE IMMEDIATE USING does not recognize session variable names
> -
>
> Key: SPARK-47033
> URL: https://issues.apache.org/jira/browse/SPARK-47033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> {noformat}
> DECLARE parm = 'Hello';
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm;
> [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all 
> parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm;
> Hello
> {noformat}
> variables are like column references, they act as their own aliases and thus 
> should not be required to be named to associate with a named parameter with 
> the same name.
> Note that unlike for pySpark this should be case insensitive (haven't 
> verified).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47033:
---
Labels: pull-request-available  (was: )

> EXECUTE IMMEDIATE USING does not recognize session variable names
> -
>
> Key: SPARK-47033
> URL: https://issues.apache.org/jira/browse/SPARK-47033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> {noformat}
> DECLARE parm = 'Hello';
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm;
> [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all 
> parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm;
> Hello
> {noformat}
> variables are like column references, they act as their own aliases and thus 
> should not be required to be named to associate with a named parameter with 
> the same name.
> Note that unlike for pySpark this should be case insensitive (haven't 
> verified).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47194) Upgrade log4j2 to 2.23.0

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47194:
---
Labels: pull-request-available  (was: )

> Upgrade log4j2 to 2.23.0
> 
>
> Key: SPARK-47194
> URL: https://issues.apache.org/jira/browse/SPARK-47194
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47194) Upgrade log4j2 to 2.23.0

2024-02-27 Thread Yang Jie (Jira)

Yang Jie created SPARK-47194:


 Summary: Upgrade log4j2 to 2.23.0
 Key: SPARK-47194
 URL: https://issues.apache.org/jira/browse/SPARK-47194
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47192:
---
Labels: pull-request-available  (was: )

> Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
> --
>
> Key: SPARK-47192
> URL: https://issues.apache.org/jira/browse/SPARK-47192
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> Old:
> > GRANT ROLE;
> _LEGACY_ERROR_TEMP_0035
> Operation not allowed: grant role. (line 1, pos 0)
>  
> New: 
> error class: HIVE_OPERATION_NOT_SUPPORTED
> The Hive operation  is not supported. (line 1, pos 0)
>  
> sqlstate: 0A000



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46834) Aggregate support for strings with collation

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46834:
---
Labels: pull-request-available  (was: )

> Aggregate support for strings with collation
> 
>
> Key: SPARK-46834
> URL: https://issues.apache.org/jira/browse/SPARK-46834
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Ivan Bova (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bova updated SPARK-47193:
--
Attachment: device.csv
deviceClass.csv
deviceType.csv
language.csv
location.csv
location1.csv
timeZoneLookup.csv
user.csv
userLocation.csv
userProfile.csv

> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val timeZoneLookup = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage]
> val userLocation = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage]
> val user = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserMessage].schema).csv("user.csv").as[MyUserMessage]
> val location = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocationMessage].schema).csv("location.csv").as[MyLocationMessage]
> val result = user
>   .join(userProfile,

[jira] [Created] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Ivan Bova (Jira)

Ivan Bova created SPARK-47193:
-

 Summary: Converting dataframe to rdd results in data loss
 Key: SPARK-47193
 URL: https://issues.apache.org/jira/browse/SPARK-47193
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1, 3.5.0
Reporter: Ivan Bova


I have 10 csv files and need to create mapping from them. After all of the 
joins dataframe contains all expected rows but rdd from this dataframe contains 
only half of them.
{code:java}
case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: String, 
LastName: String, LanguageId: Option[Int])
case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
Option[Int])
case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
Option[Double], Longitude: Option[Double], Radius: Option[Double], CreatedDate: 
Timestamp)
case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
String, Status: Int, CreatedDate: Timestamp)
case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
Address1: String, Address2: String, City: String, State: String, Country: 
String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
Option[Boolean], Level: Option[Int], TimeZone: Option[Int])

val userProfile = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
val language = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
val device = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
val deviceClass = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
val deviceType = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
val location1 = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
val timeZoneLookup = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage]
val userLocation = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage]
val user = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyUserMessage].schema).csv("user.csv").as[MyUserMessage]
val location = spark.read.option("header", "true").option("comment", 
"#").option("nullValue", 
"null").schema(Encoders.product[MyLocationMessage].schema).csv("location.csv").as[MyLocationMessage]


val result = user
  .join(userProfile, user("UserId") === userProfile("UserId"), "inner")
  .join(language, userProfile("LanguageId") === language("LanguageId"), "left")
  .join(userLocation, user("UserId") === userLocation("UserId"), "inner")
  .join(location, userLocation("LocationId") === location("LocationId"), 
"inner")
  .join(device, location("LocationId") === device("LocationId"), "inner")
  .join(deviceType, device("DeviceTypeId") === deviceType("DeviceTypeId"), 
"inner")
  .join(
deviceClass,
device("DeviceClassId") === deviceClass("DeviceClassId"),
"inner")
  .join(
timeZoneLookup,
timeZoneLookup("TimeZoneId") === location("TimeZone"),
"left")
  .join(location1, location("LocationId") === location1("LocationId"), "left")
  .where(
device("UserId1").isNull
  && (user("Active") === lit(true) || user("ActivatedDate").isNotNull)
  )

[jira] [Created] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)

2024-02-27 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-47192:


 Summary: Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
 Key: SPARK-47192
 URL: https://issues.apache.org/jira/browse/SPARK-47192
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Serge Rielau


Old:



> GRANT ROLE;

_LEGACY_ERROR_TEMP_0035

Operation not allowed: grant role. (line 1, pos 0)

 

New: 

error class: HIVE_OPERATION_NOT_SUPPORTED

The Hive operation  is not supported. (line 1, pos 0)
 
sqlstate: 0A000



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47185) Increase timeout between actions in KafkaContinuousTest

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47185:
-

Assignee: Hyukjin Kwon

> Increase timeout between actions in KafkaContinuousTest
> ---
>
> Key: SPARK-47185
> URL: https://issues.apache.org/jira/browse/SPARK-47185
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> It fails in MacOS build



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47185) Increase timeout between actions in KafkaContinuousTest

2024-02-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47185.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45283
[https://github.com/apache/spark/pull/45283]

> Increase timeout between actions in KafkaContinuousTest
> ---
>
> Key: SPARK-47185
> URL: https://issues.apache.org/jira/browse/SPARK-47185
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> It fails in MacOS build



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47191:
---
Labels: pull-request-available  (was: )

> avoid unnecessary relation lookup when uncaching table/view
> ---
>
> Key: SPARK-47191
> URL: https://issues.apache.org/jira/browse/SPARK-47191
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-47191:
---

 Summary: avoid unnecessary relation lookup when uncaching 
table/view
 Key: SPARK-47191
 URL: https://issues.apache.org/jira/browse/SPARK-47191
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47190) Add support for checkpointing to Spark Connect

2024-02-27 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821286#comment-17821286
 ] 

Nicholas Chammas commented on SPARK-47190:
--

[~gurwls223] - Is there some design reason we do _not_ want to support 
checkpointing in Spark Connect? Or is it just a matter of someone taking the 
time to implement support?

If the latter, do we do so via a new method directly on {{SparkSession}}, or 
shall we somehow expose a limited version of {{spark.sparkContext}} so users 
can call the existing {{setCheckpointDir()}} method?

> Add support for checkpointing to Spark Connect
> --
>
> Key: SPARK-47190
> URL: https://issues.apache.org/jira/browse/SPARK-47190
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{sparkContext}} that underlies a given {{SparkSession}} is not 
> accessible over Spark Connect. This means you cannot call 
> {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot 
> checkpoint a DataFrame.
> We should add support for this somehow to Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47190) Add support for checkpointing to Spark Connect

2024-02-27 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47190:


 Summary: Add support for checkpointing to Spark Connect
 Key: SPARK-47190
 URL: https://issues.apache.org/jira/browse/SPARK-47190
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


The {{sparkContext}} that underlies a given {{SparkSession}} is not accessible 
over Spark Connect. This means you cannot call 
{{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot 
checkpoint a DataFrame.

We should add support for this somehow to Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47189) Tweak column error names and text

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47189:


Assignee: Nicholas Chammas

> Tweak column error names and text
> -
>
> Key: SPARK-47189
> URL: https://issues.apache.org/jira/browse/SPARK-47189
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47189) Tweak column error names and text

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47189.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45276
[https://github.com/apache/spark/pull/45276]

> Tweak column error names and text
> -
>
> Key: SPARK-47189
> URL: https://issues.apache.org/jira/browse/SPARK-47189
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45527) Task fraction resource request is not expected

2024-02-27 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821279#comment-17821279
 ] 

Thomas Graves commented on SPARK-45527:
---

Note that this is related to SPARK-39853 which was supposed to implement stage 
level scheduling with dynamic allocation disabled.  That pr did not properly 
handle resources (gpu, fpga, etc)

> Task fraction resource request is not expected
> --
>
> Key: SPARK-45527
> URL: https://issues.apache.org/jira/browse/SPARK-45527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0
>Reporter: wuyi
>Assignee: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
>  
> {code:java}
> test("SPARK-XXX") {
>   import org.apache.spark.resource.{ResourceProfileBuilder, 
> TaskResourceRequests}
>   withTempDir { dir =>
> val scriptPath = createTempScriptWithExpectedOutput(dir, 
> "gpuDiscoveryScript",
>   """{"name": "gpu","addresses":["0"]}""")
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[1, 12, 1024]")
>   .set("spark.executor.cores", "12")
> conf.set(TASK_GPU_ID.amountConf, "0.08")
> conf.set(WORKER_GPU_ID.amountConf, "1")
> conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath)
> conf.set(EXECUTOR_GPU_ID.amountConf, "1")
> sc = new SparkContext(conf)
> val rdd = sc.range(0, 100, 1, 4)
> var rdd1 = rdd.repartition(3)
> val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0)
> val rp = new ResourceProfileBuilder().require(treqs).build
> rdd1 = rdd1.withResources(rp)
> assert(rdd1.collect().size === 100)
>   }
> } {code}
> In the above test, the 3 tasks generated by rdd1 are expected to be executed 
> in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", 
> 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, 
> those 3 tasks are run in parallel in fact.
> The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. 
> In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't 
> change even if there's a new task resource request (e.g., resource("gpu", 
> 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-27 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821273#comment-17821273
 ] 

Pablo Langa Blanco commented on SPARK-47063:


Ok, personally I would prefer the truncation too, I'll make a PR with it and we 
can discuss it there.

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Priority: Major
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala>

[jira] [Updated] (SPARK-47189) Tweak column error names and text

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47189:
---
Labels: pull-request-available  (was: )

> Tweak column error names and text
> -
>
> Key: SPARK-47189
> URL: https://issues.apache.org/jira/browse/SPARK-47189
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47189) Tweak column error names and text

2024-02-27 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47189:


 Summary: Tweak column error names and text
 Key: SPARK-47189
 URL: https://issues.apache.org/jira/browse/SPARK-47189
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47145:
---
Labels: pull-request-available  (was: )

> Provide table identifier to scan node when DS v2 strategy is applied
> 
>
> Key: SPARK-47145
> URL: https://issues.apache.org/jira/browse/SPARK-47145
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, DataSourceScanExec node can accept table identifier, and that 
> information can be useful for later logging, debugging, etc, but 
> DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47179) Improve error message from SparkThrowableSuite for better debuggability

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47179.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45273
[https://github.com/apache/spark/pull/45273]

> Improve error message from SparkThrowableSuite for better debuggability
> ---
>
> Key: SPARK-47179
> URL: https://issues.apache.org/jira/browse/SPARK-47179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Current error message is not very helpful when error classes documentation is 
> not up-to-date so we better improve it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47179) Improve error message from SparkThrowableSuite for better debuggability

2024-02-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47179:


Assignee: Haejoon Lee

> Improve error message from SparkThrowableSuite for better debuggability
> ---
>
> Key: SPARK-47179
> URL: https://issues.apache.org/jira/browse/SPARK-47179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Current error message is not very helpful when error classes documentation is 
> not up-to-date so we better improve it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 121 matches

Mail list logo