[jira] [Resolved] (SPARK-46809) Check error message parameter properly

2024-02-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-46809.
-
Resolution: Not A Bug

Seems to be working fine

> Check error message parameter properly
> --
>
> Key: SPARK-46809
> URL: https://issues.apache.org/jira/browse/SPARK-46809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> If error message parameter from template is missing in actual usage or the 
> name is different, it should raise exception but currently it's not. We 
> should handle this to work properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47127) Update `SKIP_SPARK_RELEASE_VERSIONS` in Maven CIs

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47127.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45212
[https://github.com/apache/spark/pull/45212]

> Update `SKIP_SPARK_RELEASE_VERSIONS` in Maven CIs
> -
>
> Key: SPARK-47127
> URL: https://issues.apache.org/jira/browse/SPARK-47127
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We need to skip newly released Apache Spark 3.5.1 and remove removed 3.3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44914) Upgrade Ivy to 2.5.2

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44914:
--
Summary: Upgrade Ivy to 2.5.2  (was: Upgrade Apache Ivy to 2.5.2)

> Upgrade Ivy to 2.5.2
> 
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44914) Upgrade Apache Ivy to 2.5.2

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44914:
--
Summary: Upgrade Apache Ivy to 2.5.2  (was: Upgrade Apache ivy  to 2.5.2)

> Upgrade Apache Ivy to 2.5.2
> ---
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47126) Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite

2024-02-21 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47126:
-

 Summary: Re-enable Spark 3.4 test in 
HiveExternalCatalogVersionsSuite
 Key: SPARK-47126
 URL: https://issues.apache.org/jira/browse/SPARK-47126
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


HiveExternalCatalogVersionsSuite requires SPARK-46400



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47125) Return null if Univocity never triggers parsing

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47125:
-
Fix Version/s: 3.5.2
   (was: 3.5.3)

> Return null if Univocity never triggers parsing
> ---
>
> Key: SPARK-47125
> URL: https://issues.apache.org/jira/browse/SPARK-47125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> See the linked PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47125) Return null if Univocity never triggers parsing

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47125:
-
Fix Version/s: 3.5.3
   3.4.3

> Return null if Univocity never triggers parsing
> ---
>
> Key: SPARK-47125
> URL: https://issues.apache.org/jira/browse/SPARK-47125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.3, 3.5.3
>
>
> See the linked PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47117) Format the whole code base

2024-02-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-47117:
-
Priority: Trivial  (was: Major)

> Format the whole code base
> --
>
> Key: SPARK-47117
> URL: https://issues.apache.org/jira/browse/SPARK-47117
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jie Han
>Priority: Trivial
>
> I try scalafmt and find that it produces so much diff :(. Do we need format 
> the whole code base in a pr?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47125) Return null if Univocity never triggers parsing

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47125:


Assignee: Hyukjin Kwon

> Return null if Univocity never triggers parsing
> ---
>
> Key: SPARK-47125
> URL: https://issues.apache.org/jira/browse/SPARK-47125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> See the linked PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47125) Return null if Univocity never triggers parsing

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47125.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45210
[https://github.com/apache/spark/pull/45210]

> Return null if Univocity never triggers parsing
> ---
>
> Key: SPARK-47125
> URL: https://issues.apache.org/jira/browse/SPARK-47125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> See the linked PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)

2024-02-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34631:
-
Description: 
emphasized textcreate a table, set location as parquet, do msck repair table to 
get the data.

But when query with partition column, got some errors (adding backtick would 
not help)
{code:java}
// code placeholder
{code}
select count from some_table where `$partition_date` = '2015-01-01'

 
{panel:title=error:}
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:962)
 at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:174)
 at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:166)
 at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
 at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
 at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2470)
 at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:191)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
 at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
 at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
 at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
 at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
 at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
 at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
 at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
 at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
 at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
 at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
 at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
 at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
 at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
 ... 49 elided
Caused by: java.lang.reflect.InvocationTargetException: 
org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition 
filter : line 1:0 no 

[jira] [Resolved] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47101.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45180
[https://github.com/apache/spark/pull/45180]

> HiveExternalCatalog.verifyDataSchema does not fully comply with hive column 
> name rules
> --
>
> Key: SPARK-47101
> URL: https://issues.apache.org/jira/browse/SPARK-47101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47101:
-

Assignee: Kent Yao

> HiveExternalCatalog.verifyDataSchema does not fully comply with hive column 
> name rules
> --
>
> Key: SPARK-47101
> URL: https://issues.apache.org/jira/browse/SPARK-47101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47036) RocksDB versionID Mismatch in SST files with Compaction

2024-02-21 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-47036:
-
Fix Version/s: 3.5.2

> RocksDB versionID Mismatch in SST files with Compaction
> ---
>
> Key: SPARK-47036
> URL: https://issues.apache.org/jira/browse/SPARK-47036
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> RocksDB compaction can result in version Id mismatch errors if the same 
> version is committed twice from the same executor. (Multiple commits can 
> happen due to Spark Stage/task retry).
> A particular scenario where this can happen is provided below: 
>  # Version V1 is loaded on executor A, RocksDB State Store has 195.sst, 
> 196.sst, 197.sst and 198.sst files. 
> 2. State changes are made, which result in creation of a new table file 
> 200.sst. 
> 3. State store is committed as version V2. The SST file 200.sst (as 
> 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and 
> previous 4 files are reused. A new metadata file is created to track the 
> exact SST files with unique IDs, and uploaded with RocksDB Manifest as part 
> of V1.zip.
> 4. Rocks DB compaction is triggered at the same time. The compaction creates 
> a new L1 file (201.sst), and deletes existing 5 SST files.
> 5. Spark Stage is retried. 
> 6. Version V1 is reloaded on the same executor. The local files are 
> inspected, and 201.sst is deleted. The 4 SST files in version V1 are 
> downloaded again to local file system. 
> 7. Any local files which are deleted (as part of version load) are also 
> removed from local → DFS file upload tracking. **However, the files already 
> deleted as a result of compaction are not removed from tracking. This is the 
> bug which resulted in the failure.**
> 8. State store is committed as version V1. However, the local mapping of SST 
> files to DFS file path still has 200.sst in its tracking, hence the SST file 
> is not re-uploaded. A new metadata file is created to track the exact SST 
> files with unique IDs, and uploaded with the new RocksDB Manifest as part of 
> V2.zip. (The V2.zip file is overwritten here atomically)
> 9. A new executor tried to load version V2. However, the SST files in (1) are 
> now incompatible with Manifest file in (6) resulting in the version Id 
> mismatch failure.
>  
> We need to ensure that any files deleted from local filesystem post 
> compaction are not tracked in uploadedDFSFiles mapping if the same version is 
> loaded again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47124:
-

Assignee: Hyukjin Kwon

> Skip scheduled SparkR on Windows in fork repositories by default
> 
>
> Key: SPARK-47124
> URL: https://issues.apache.org/jira/browse/SPARK-47124
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> To be consistent with other scheduled build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47124.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45208
[https://github.com/apache/spark/pull/45208]

> Skip scheduled SparkR on Windows in fork repositories by default
> 
>
> Key: SPARK-47124
> URL: https://issues.apache.org/jira/browse/SPARK-47124
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> To be consistent with other scheduled build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47125) Return null if Univocity never triggers parsing

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47125:
---
Labels: pull-request-available  (was: )

> Return null if Univocity never triggers parsing
> ---
>
> Key: SPARK-47125
> URL: https://issues.apache.org/jira/browse/SPARK-47125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> See the linked PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47125) Return null if Univocity never triggers parsing

2024-02-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47125:


 Summary: Return null if Univocity never triggers parsing
 Key: SPARK-47125
 URL: https://issues.apache.org/jira/browse/SPARK-47125
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.2, 4.0.0, 3.5.1
Reporter: Hyukjin Kwon


See the linked PR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47123:
---
Labels: pull-request-available  (was: )

> JDBCRDD does not correctly handle errors in getQueryOutputSchema
> 
>
> Key: SPARK-47123
> URL: https://issues.apache.org/jira/browse/SPARK-47123
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>  Labels: pull-request-available
>
> If there is an error executing statement.executeQuery(), it's possible that 
> another error in one of the finally statements makes us not see the main 
> error.
> {code:java}
> def getQueryOutputSchema(
>       query: String, options: JDBCOptions, dialect: JdbcDialect): StructType 
> = {
>     val conn: Connection = dialect.createConnectionFactory(options)(-1)
>     try {
>       val statement = conn.prepareStatement(query)
>       try {
>         statement.setQueryTimeout(options.queryTimeout)
>         val rs = statement.executeQuery()
>         try {
>           JdbcUtils.getSchema(rs, dialect, alwaysNullable = true,
>             isTimestampNTZ = options.preferTimestampNTZ)
>         } finally {
>           rs.close()
>         }
>       } finally {
>         statement.close()
>       }
>     } finally {
>       conn.close()
>     }
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47124:
---
Labels: pull-request-available  (was: )

> Skip scheduled SparkR on Windows in fork repositories by default
> 
>
> Key: SPARK-47124
> URL: https://issues.apache.org/jira/browse/SPARK-47124
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> To be consistent with other scheduled build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47124:
-
Priority: Minor  (was: Major)

> Skip scheduled SparkR on Windows in fork repositories by default
> 
>
> Key: SPARK-47124
> URL: https://issues.apache.org/jira/browse/SPARK-47124
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> To be consistent with other scheduled build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47124:
-
Summary: Skip scheduled SparkR on Windows in fork repositories by default  
(was: Skip SparkR on Windows in fork repositories)

> Skip scheduled SparkR on Windows in fork repositories by default
> 
>
> Key: SPARK-47124
> URL: https://issues.apache.org/jira/browse/SPARK-47124
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> To be consistent with other scheduled build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47124) Skip SparkR on Windows in fork repositories

2024-02-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47124:


 Summary: Skip SparkR on Windows in fork repositories
 Key: SPARK-47124
 URL: https://issues.apache.org/jira/browse/SPARK-47124
 Project: Spark
  Issue Type: Test
  Components: Project Infra, R
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


To be consistent with other scheduled build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31745) Enable Hive related test cases of SparkR on Windows

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-31745:
---
Labels: pull-request-available  (was: )

> Enable Hive related test cases of SparkR on Windows
> ---
>
> Key: SPARK-31745
> URL: https://issues.apache.org/jira/browse/SPARK-31745
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Hive related tests look being skipped in AppVeyor:
> {code}
> test_sparkSQL.R:307: skip: create DataFrame from RDD
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:1341: skip: test HiveContext
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:2813: skip: read/write ORC files
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:2834: skip: read/write ORC files - compression option
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:3727: skip: enableHiveSupport on SparkSession
> Reason: Hive is not build with SparkSQL, skipped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31745) Enable Hive related test cases of SparkR on Windows

2024-02-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31745:
-
Summary: Enable Hive related test cases of SparkR on Windows  (was: Enable 
Hive related test cases of SparkR in AppVeyor)

> Enable Hive related test cases of SparkR on Windows
> ---
>
> Key: SPARK-31745
> URL: https://issues.apache.org/jira/browse/SPARK-31745
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Hive related tests look being skipped in AppVeyor:
> {code}
> test_sparkSQL.R:307: skip: create DataFrame from RDD
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:1341: skip: test HiveContext
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:2813: skip: read/write ORC files
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:2834: skip: read/write ORC files - compression option
> Reason: Hive is not build with SparkSQL, skipped
> test_sparkSQL.R:3727: skip: enableHiveSupport on SparkSession
> Reason: Hive is not build with SparkSQL, skipped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema

2024-02-21 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-47123:
--

 Summary: JDBCRDD does not correctly handle errors in 
getQueryOutputSchema
 Key: SPARK-47123
 URL: https://issues.apache.org/jira/browse/SPARK-47123
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0, 4.0.0
Reporter: Pablo Langa Blanco


If there is an error executing statement.executeQuery(), it's possible that 
another error in one of the finally statements makes us not see the main error.
{code:java}
def getQueryOutputSchema(
      query: String, options: JDBCOptions, dialect: JdbcDialect): StructType = {
    val conn: Connection = dialect.createConnectionFactory(options)(-1)
    try {
      val statement = conn.prepareStatement(query)
      try {
        statement.setQueryTimeout(options.queryTimeout)
        val rs = statement.executeQuery()
        try {
          JdbcUtils.getSchema(rs, dialect, alwaysNullable = true,
            isTimestampNTZ = options.preferTimestampNTZ)
        } finally {
          rs.close()
        }
      } finally {
        statement.close()
      }
    } finally {
      conn.close()
    }
  } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema

2024-02-21 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819413#comment-17819413
 ] 

Pablo Langa Blanco commented on SPARK-47123:


I'm working on it

> JDBCRDD does not correctly handle errors in getQueryOutputSchema
> 
>
> Key: SPARK-47123
> URL: https://issues.apache.org/jira/browse/SPARK-47123
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> If there is an error executing statement.executeQuery(), it's possible that 
> another error in one of the finally statements makes us not see the main 
> error.
> {code:java}
> def getQueryOutputSchema(
>       query: String, options: JDBCOptions, dialect: JdbcDialect): StructType 
> = {
>     val conn: Connection = dialect.createConnectionFactory(options)(-1)
>     try {
>       val statement = conn.prepareStatement(query)
>       try {
>         statement.setQueryTimeout(options.queryTimeout)
>         val rs = statement.executeQuery()
>         try {
>           JdbcUtils.getSchema(rs, dialect, alwaysNullable = true,
>             isTimestampNTZ = options.preferTimestampNTZ)
>         } finally {
>           rs.close()
>         }
>       } finally {
>         statement.close()
>       }
>     } finally {
>       conn.close()
>     }
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47121) Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47121.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45203
[https://github.com/apache/spark/pull/45203]

> Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend 
> shutdown
> --
>
> Key: SPARK-47121
> URL: https://issues.apache.org/jira/browse/SPARK-47121
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> While it is in the process of shutting down, the StandaloneSchedulerBackend 
> might throw RejectedExecutionExceptions when RPC handler `onDisconnected` 
> methods attempt to submit new tasks to a stopped executorDelayRemoveThread 
> executor service. We can reduce log and uncaught exception noise by catching 
> and ignoring these exceptions if they occur during shudown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47122:
-

Assignee: Dongjoon Hyun

> Pin `buf-setup-action` to `v1.29.0`
> ---
>
> Key: SPARK-47122
> URL: https://issues.apache.org/jira/browse/SPARK-47122
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47122.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45205
[https://github.com/apache/spark/pull/45205]

> Pin `buf-setup-action` to `v1.29.0`
> ---
>
> Key: SPARK-47122
> URL: https://issues.apache.org/jira/browse/SPARK-47122
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47122:
---
Labels: pull-request-available  (was: )

> Pin `buf-setup-action` to `v1.29.0`
> ---
>
> Key: SPARK-47122
> URL: https://issues.apache.org/jira/browse/SPARK-47122
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`

2024-02-21 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47122:
-

 Summary: Pin `buf-setup-action` to `v1.29.0`
 Key: SPARK-47122
 URL: https://issues.apache.org/jira/browse/SPARK-47122
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47104.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45199
[https://github.com/apache/spark/pull/45199]

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> 

[jira] [Assigned] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47104:
-

Assignee: Bruce Robbins

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--

[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47104:
--
Affects Version/s: 3.0.3

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: pull-request-available
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This message was sent by Atlassian Jira

[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47104:
--
Affects Version/s: 3.1.3

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: pull-request-available
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This message was sent by Atlassian Jira

[jira] [Resolved] (SPARK-47119) Add `hive-jackson-provided` profile

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47119.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45201
[https://github.com/apache/spark/pull/45201]

> Add `hive-jackson-provided` profile
> ---
>
> Key: SPARK-47119
> URL: https://issues.apache.org/jira/browse/SPARK-47119
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47036) RocksDB versionID Mismatch in SST files with Compaction

2024-02-21 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47036:


Assignee: Bhuwan Sahni

> RocksDB versionID Mismatch in SST files with Compaction
> ---
>
> Key: SPARK-47036
> URL: https://issues.apache.org/jira/browse/SPARK-47036
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>
> RocksDB compaction can result in version Id mismatch errors if the same 
> version is committed twice from the same executor. (Multiple commits can 
> happen due to Spark Stage/task retry).
> A particular scenario where this can happen is provided below: 
>  # Version V1 is loaded on executor A, RocksDB State Store has 195.sst, 
> 196.sst, 197.sst and 198.sst files. 
> 2. State changes are made, which result in creation of a new table file 
> 200.sst. 
> 3. State store is committed as version V2. The SST file 200.sst (as 
> 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and 
> previous 4 files are reused. A new metadata file is created to track the 
> exact SST files with unique IDs, and uploaded with RocksDB Manifest as part 
> of V1.zip.
> 4. Rocks DB compaction is triggered at the same time. The compaction creates 
> a new L1 file (201.sst), and deletes existing 5 SST files.
> 5. Spark Stage is retried. 
> 6. Version V1 is reloaded on the same executor. The local files are 
> inspected, and 201.sst is deleted. The 4 SST files in version V1 are 
> downloaded again to local file system. 
> 7. Any local files which are deleted (as part of version load) are also 
> removed from local → DFS file upload tracking. **However, the files already 
> deleted as a result of compaction are not removed from tracking. This is the 
> bug which resulted in the failure.**
> 8. State store is committed as version V1. However, the local mapping of SST 
> files to DFS file path still has 200.sst in its tracking, hence the SST file 
> is not re-uploaded. A new metadata file is created to track the exact SST 
> files with unique IDs, and uploaded with the new RocksDB Manifest as part of 
> V2.zip. (The V2.zip file is overwritten here atomically)
> 9. A new executor tried to load version V2. However, the SST files in (1) are 
> now incompatible with Manifest file in (6) resulting in the version Id 
> mismatch failure.
>  
> We need to ensure that any files deleted from local filesystem post 
> compaction are not tracked in uploadedDFSFiles mapping if the same version is 
> loaded again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47036) RocksDB versionID Mismatch in SST files with Compaction

2024-02-21 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47036.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45092
[https://github.com/apache/spark/pull/45092]

> RocksDB versionID Mismatch in SST files with Compaction
> ---
>
> Key: SPARK-47036
> URL: https://issues.apache.org/jira/browse/SPARK-47036
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> RocksDB compaction can result in version Id mismatch errors if the same 
> version is committed twice from the same executor. (Multiple commits can 
> happen due to Spark Stage/task retry).
> A particular scenario where this can happen is provided below: 
>  # Version V1 is loaded on executor A, RocksDB State Store has 195.sst, 
> 196.sst, 197.sst and 198.sst files. 
> 2. State changes are made, which result in creation of a new table file 
> 200.sst. 
> 3. State store is committed as version V2. The SST file 200.sst (as 
> 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and 
> previous 4 files are reused. A new metadata file is created to track the 
> exact SST files with unique IDs, and uploaded with RocksDB Manifest as part 
> of V1.zip.
> 4. Rocks DB compaction is triggered at the same time. The compaction creates 
> a new L1 file (201.sst), and deletes existing 5 SST files.
> 5. Spark Stage is retried. 
> 6. Version V1 is reloaded on the same executor. The local files are 
> inspected, and 201.sst is deleted. The 4 SST files in version V1 are 
> downloaded again to local file system. 
> 7. Any local files which are deleted (as part of version load) are also 
> removed from local → DFS file upload tracking. **However, the files already 
> deleted as a result of compaction are not removed from tracking. This is the 
> bug which resulted in the failure.**
> 8. State store is committed as version V1. However, the local mapping of SST 
> files to DFS file path still has 200.sst in its tracking, hence the SST file 
> is not re-uploaded. A new metadata file is created to track the exact SST 
> files with unique IDs, and uploaded with the new RocksDB Manifest as part of 
> V2.zip. (The V2.zip file is overwritten here atomically)
> 9. A new executor tried to load version V2. However, the SST files in (1) are 
> now incompatible with Manifest file in (6) resulting in the version Id 
> mismatch failure.
>  
> We need to ensure that any files deleted from local filesystem post 
> compaction are not tracked in uploadedDFSFiles mapping if the same version is 
> loaded again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47121) Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47121:
---
Labels: pull-request-available  (was: )

> Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend 
> shutdown
> --
>
> Key: SPARK-47121
> URL: https://issues.apache.org/jira/browse/SPARK-47121
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
>
> While it is in the process of shutting down, the StandaloneSchedulerBackend 
> might throw RejectedExecutionExceptions when RPC handler `onDisconnected` 
> methods attempt to submit new tasks to a stopped executorDelayRemoveThread 
> executor service. We can reduce log and uncaught exception noise by catching 
> and ignoring these exceptions if they occur during shudown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47121) Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown

2024-02-21 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-47121:
--

 Summary: Avoid noisy RejectedExecutionExceptions during 
StandaloneSchedulerBackend shutdown
 Key: SPARK-47121
 URL: https://issues.apache.org/jira/browse/SPARK-47121
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 3.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen


While it is in the process of shutting down, the StandaloneSchedulerBackend 
might throw RejectedExecutionExceptions when RPC handler `onDisconnected` 
methods attempt to submit new tasks to a stopped executorDelayRemoveThread 
executor service. We can reduce log and uncaught exception noise by catching 
and ignoring these exceptions if they occur during shudown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47120:
---
Labels: pull-request-available  (was: )

> Null comparison push down data filter from subquery produces in NPE in 
> Parquet filter
> -
>
> Key: SPARK-47120
> URL: https://issues.apache.org/jira/browse/SPARK-47120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> This issue has been introduced in 
> [https://github.com/apache/spark/pull/41088]  where we convert scalar 
> subqueries to literals and then convert the literals to 
> {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
> down to parquet.
> If the literal is a comparison with {{null}} then the parquet filter 
> conversion code throws NPE. 
>  
> repro code which results in NPE
> {code:java}
> create table t1(d date) using parquet
> create table t2(d date) using parquet
> insert into t1 values date'2021-01-01'
> insert into t2 values (null)
> select * from t1 where 1=1 and d > (select d from t2){code}
> I'll provide a fix PR shortly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter

2024-02-21 Thread Cosmin Dumitru (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Dumitru updated SPARK-47120:
---
Description: 
This issue has been introduced in [https://github.com/apache/spark/pull/41088]  
where we convert scalar subqueries to literals and then convert the literals to 
{{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
down to parquet.

If the literal is a comparison with {{null}} then the parquet filter conversion 
code throws NPE. 

 

repro code which results in NPE
{code:java}
create table t1(d date) using parquet
create table t2(d date) using parquet
insert into t1 values date'2021-01-01'
insert into t2 values (null)
select * from t1 where 1=1 and d > (select d from t2){code}
[fix PR |https://github.com/apache/spark/pull/45202/files]

  was:
This issue has been introduced in [https://github.com/apache/spark/pull/41088]  
where we convert scalar subqueries to literals and then convert the literals to 
{{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
down to parquet.

If the literal is a comparison with {{null}} then the parquet filter conversion 
code throws NPE. 

 

repro code which results in NPE
{code:java}
create table t1(d date) using parquet
create table t2(d date) using parquet
insert into t1 values date'2021-01-01'
insert into t2 values (null)
select * from t1 where 1=1 and d > (select d from t2){code}
I'll provide a fix PR shortly


> Null comparison push down data filter from subquery produces in NPE in 
> Parquet filter
> -
>
> Key: SPARK-47120
> URL: https://issues.apache.org/jira/browse/SPARK-47120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cosmin Dumitru
>Priority: Major
>  Labels: pull-request-available
>
> This issue has been introduced in 
> [https://github.com/apache/spark/pull/41088]  where we convert scalar 
> subqueries to literals and then convert the literals to 
> {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
> down to parquet.
> If the literal is a comparison with {{null}} then the parquet filter 
> conversion code throws NPE. 
>  
> repro code which results in NPE
> {code:java}
> create table t1(d date) using parquet
> create table t2(d date) using parquet
> insert into t1 values date'2021-01-01'
> insert into t2 values (null)
> select * from t1 where 1=1 and d > (select d from t2){code}
> [fix PR |https://github.com/apache/spark/pull/45202/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter

2024-02-21 Thread Cosmin Dumitru (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Dumitru updated SPARK-47120:
---
Description: 
This issue has been introduced in [https://github.com/apache/spark/pull/41088]  
where we convert scalar subqueries to literals and then convert the literals to 
{{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
down to parquet.

If the literal is a comparison with {{null}} then the parquet filter conversion 
code throws NPE. 

 

repro code which results in NPE
{code:java}
create table t1(d date) using parquet
create table t2(d date) using parquet
insert into t1 values date'2021-01-01'
insert into t2 values (null)
select * from t1 where 1=1 and d > (select d from t2){code}
I'll provide a fix PR shortly

  was:
This issue has been introduced in [https://github.com/apache/spark/pull/41088]  
where we convert scalar subqueries to literals and then convert the literals to 
{{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
down to parquet.

If the literal is a comparison with {{null}} then the parquet filter conversion 
code throws NPE. 

 

repro code which results in NPE
{code:java}
create table t1(d date) using parquet
create table t2(d date) using parquet
insert into t1 values date'2021-01-01'
insert into t2 values (null)
select * from t1 where 1=1 and d > (select d from t2){code}


> Null comparison push down data filter from subquery produces in NPE in 
> Parquet filter
> -
>
> Key: SPARK-47120
> URL: https://issues.apache.org/jira/browse/SPARK-47120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cosmin Dumitru
>Priority: Major
>
> This issue has been introduced in 
> [https://github.com/apache/spark/pull/41088]  where we convert scalar 
> subqueries to literals and then convert the literals to 
> {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
> down to parquet.
> If the literal is a comparison with {{null}} then the parquet filter 
> conversion code throws NPE. 
>  
> repro code which results in NPE
> {code:java}
> create table t1(d date) using parquet
> create table t2(d date) using parquet
> insert into t1 values date'2021-01-01'
> insert into t2 values (null)
> select * from t1 where 1=1 and d > (select d from t2){code}
> I'll provide a fix PR shortly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter

2024-02-21 Thread Cosmin Dumitru (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Dumitru updated SPARK-47120:
---
Description: 
This issue has been introduced in [https://github.com/apache/spark/pull/41088]  
where we convert scalar subqueries to literals and then convert the literals to 
{{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
down to parquet.

If the literal is a comparison with {{null}} then the parquet filter conversion 
code throws NPE. 

 

repro code which results in NPE
{code:java}
create table t1(d date) using parquet
create table t2(d date) using parquet
insert into t1 values date'2021-01-01'
insert into t2 values (null)
select * from t1 where 1=1 and d > (select d from t2){code}

> Null comparison push down data filter from subquery produces in NPE in 
> Parquet filter
> -
>
> Key: SPARK-47120
> URL: https://issues.apache.org/jira/browse/SPARK-47120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cosmin Dumitru
>Priority: Major
>
> This issue has been introduced in 
> [https://github.com/apache/spark/pull/41088]  where we convert scalar 
> subqueries to literals and then convert the literals to 
> {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed 
> down to parquet.
> If the literal is a comparison with {{null}} then the parquet filter 
> conversion code throws NPE. 
>  
> repro code which results in NPE
> {code:java}
> create table t1(d date) using parquet
> create table t2(d date) using parquet
> insert into t1 values date'2021-01-01'
> insert into t2 values (null)
> select * from t1 where 1=1 and d > (select d from t2){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47119) Add `hive-jackson-provided` profile

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47119:
---
Labels: pull-request-available  (was: )

> Add `hive-jackson-provided` profile
> ---
>
> Key: SPARK-47119
> URL: https://issues.apache.org/jira/browse/SPARK-47119
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47119) Add `hive-jackson-provided` profile

2024-02-21 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47119:
-

 Summary: Add `hive-jackson-provided` profile
 Key: SPARK-47119
 URL: https://issues.apache.org/jira/browse/SPARK-47119
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47119) Add `hive-jackson-provided` profile

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47119:
-

Assignee: Dongjoon Hyun

> Add `hive-jackson-provided` profile
> ---
>
> Key: SPARK-47119
> URL: https://issues.apache.org/jira/browse/SPARK-47119
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47099) The `start` value of `paramIndex` for the error class `UNEXPECTED_INPUT_TYPE` should be `1`1

2024-02-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47099.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45177
[https://github.com/apache/spark/pull/45177]

> The `start` value of `paramIndex` for the error class `UNEXPECTED_INPUT_TYPE` 
> should be `1`1
> 
>
> Key: SPARK-47099
> URL: https://issues.apache.org/jira/browse/SPARK-47099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47104:
---
Labels: pull-request-available  (was: )

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: pull-request-available
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This message was sent by Atlassian Jira

[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44719:
---
Labels: pull-request-available  (was: )

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 4.0.0
>
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2024-02-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819341#comment-17819341
 ] 

Dongjoon Hyun commented on SPARK-43225:
---

For the record, this is logically reverted via SPARK-44719 at Spark 3.5.0 by 
the author, [~yumwang].

> Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
> --
>
> Key: SPARK-43225
> URL: https://issues.apache.org/jira/browse/SPARK-43225
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.5.0
>
>
> To fix CVE issue: https://github.com/apache/spark/security/dependabot/50



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47085) Preformance issue on thrift API

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47085:
--
Fix Version/s: 3.4.3

> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> This new complexity was introduced in SPARK-39041.
> before this PR the code was:
> {code:java}
> while (curRow < maxRows && iter.hasNext) {
>   val sparkRow = iter.next()
>   val row = ArrayBuffer[Any]()
>   var curCol = 0
>   while (curCol < sparkRow.length) {
> if (sparkRow.isNullAt(curCol)) {
>   row += null
> } else {
>   addNonNullColumnValue(sparkRow, row, curCol, timeFormatters)
> }
> curCol += 1
>   }
>   resultRowSet.addRow(row.toArray.asInstanceOf[Array[Object]])
>   curRow += 1
> }{code}
>  foreach without the _*O(n^2)*_ complexity so this change just return the 
> state to what it was before.
>  
> In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted back into _*O( n )*_ complexity.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46934) Read/write roundtrip for struct type with special characters with HMS

2024-02-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819309#comment-17819309
 ] 

Dongjoon Hyun commented on SPARK-46934:
---

You can track Apache Spark 4.0.0 activity in SPARK-44111, [~yutinglin].

> Read/write roundtrip for struct type with special characters with HMS 
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> 

[jira] [Commented] (SPARK-46934) Read/write roundtrip for struct type with special characters with HMS

2024-02-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819308#comment-17819308
 ] 

Dongjoon Hyun commented on SPARK-46934:
---

Thank you for confirming, [~yutinglin] and [~yao] .

I revised this JIRA issue to `Improvement` JIRA with the `Affected Versions` 
4.0.0.

According to the Semantic Versioning policy, new features and improvements are 
only for unable to affect maintenance releases with PATCH version changes.
- https://spark.apache.org/versioning-policy.html
- https://semver.org

> Read/write roundtrip for struct type with special characters with HMS 
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at 

[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47104:
--
Affects Version/s: 3.5.0
   3.4.2

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Priority: Major
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46934:
--
Issue Type: Improvement  (was: Bug)

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> 

[jira] [Updated] (SPARK-46934) Read/write roundtrip for struct type with special characters with HMS

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46934:
--
Summary: Read/write roundtrip for struct type with special characters with 
HMS   (was: Unable to create Hive View from certain Spark Dataframe StructType)

> Read/write roundtrip for struct type with special characters with HMS 
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> 

[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46934:
--
Priority: Major  (was: Blocker)

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.2, 3.3.4
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> 

[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46934:
--
Affects Version/s: 4.0.0
   (was: 3.3.0)
   (was: 3.3.2)
   (was: 3.3.4)

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> 

[jira] [Updated] (SPARK-46938) Remove javax-servlet-api exclusion rule for SBT

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46938:
--
Summary: Remove javax-servlet-api exclusion rule for SBT  (was: Migrate 
jetty 10 to jetty 11)

> Remove javax-servlet-api exclusion rule for SBT
> ---
>
> Key: SPARK-46938
> URL: https://issues.apache.org/jira/browse/SPARK-46938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: HiuFung Kwok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46938) Migrate jetty 10 to jetty 11

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46938.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45194
[https://github.com/apache/spark/pull/45194]

> Migrate jetty 10 to jetty 11
> 
>
> Key: SPARK-46938
> URL: https://issues.apache.org/jira/browse/SPARK-46938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: HiuFung Kwok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

2024-02-21 Thread Ramakrishna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819249#comment-17819249
 ] 

Ramakrishna edited comment on SPARK-21595 at 2/21/24 1:34 PM:
--

[~Rakesh_Shah]  How did you manage to solve this ? I am getting this in my 
streaming query, it does aggregations similar to other streaming queries in 
same job. However it fails and I get

 

 

{"timestamp":"21/02/2024 
07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task 
launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output 
location for shuffle 5 partition 35"}

 

Can you please help ?

 

[~tejasp]  Can you please help ?

 

My spark version is 3.4.0


was (Author: hande):
[~Rakesh_Shah]  How did you manage to solve this ? I am getting this in my 
streaming query, it does aggregations similar to other streaming queries in 
same job. However it fails and I get

 

 

{"timestamp":"21/02/2024 
07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task 
launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output 
location for shuffle 5 partition 35"}

 

Can you please help ?

 

[~tejasp]  Can you please help ?

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0
> Environment: pyspark on linux
>Reporter: Stephan Reiling
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: documentation, regression
> Fix For: 2.2.1, 2.3.0
>
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

2024-02-21 Thread Ramakrishna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819249#comment-17819249
 ] 

Ramakrishna commented on SPARK-21595:
-

[~Rakesh_Shah]  How did you manage to solve this ? I am getting this in my 
streaming query, it does aggregations similar to other streaming queries in 
same job. However it fails and I get

 

 

{"timestamp":"21/02/2024 
07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task 
launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output 
location for shuffle 5 partition 35"}

 

Can you please help ?

 

[~tejasp]  Can you please help ?

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0
> Environment: pyspark on linux
>Reporter: Stephan Reiling
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: documentation, regression
> Fix For: 2.2.1, 2.3.0
>
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021

2024-02-21 Thread A G (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819237#comment-17819237
 ] 

A G commented on SPARK-43256:
-

I want to work on this. PR: [https://github.com/apache/spark/pull/45198]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2021
> 
>
> Key: SPARK-43256
> URL: https://issues.apache.org/jira/browse/SPARK-43256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43256:
---
Labels: pull-request-available starter  (was: starter)

> Assign a name to the error class _LEGACY_ERROR_TEMP_2021
> 
>
> Key: SPARK-43256
> URL: https://issues.apache.org/jira/browse/SPARK-43256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-21 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819232#comment-17819232
 ] 

Kent Yao commented on SPARK-46934:
--

Hi [~dongjoon], this is not a regression.

As I commented above, I tried Hive 2.3.9 and I don't think Hive DDL supports 
'/' in these field names of struct type

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.2, 3.3.4
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Assignee: Kent Yao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at 

[jira] [Created] (SPARK-47117) Format the whole code base

2024-02-21 Thread Jie Han (Jira)
Jie Han created SPARK-47117:
---

 Summary: Format the whole code base
 Key: SPARK-47117
 URL: https://issues.apache.org/jira/browse/SPARK-47117
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Jie Han


I try scalafmt and find that it produces so much diff :(. Do we need format the 
whole code base in a pr?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46938) Migrate jetty 10 to jetty 11

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46938:
--

Assignee: Apache Spark  (was: HiuFung Kwok)

> Migrate jetty 10 to jetty 11
> 
>
> Key: SPARK-46938
> URL: https://issues.apache.org/jira/browse/SPARK-46938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46938) Migrate jetty 10 to jetty 11

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46938:
--

Assignee: Apache Spark  (was: HiuFung Kwok)

> Migrate jetty 10 to jetty 11
> 
>
> Key: SPARK-46938
> URL: https://issues.apache.org/jira/browse/SPARK-46938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47101:
--

Assignee: (was: Apache Spark)

> HiveExternalCatalog.verifyDataSchema does not fully comply with hive column 
> name rules
> --
>
> Key: SPARK-47101
> URL: https://issues.apache.org/jira/browse/SPARK-47101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47101:
--

Assignee: Apache Spark

> HiveExternalCatalog.verifyDataSchema does not fully comply with hive column 
> name rules
> --
>
> Key: SPARK-47101
> URL: https://issues.apache.org/jira/browse/SPARK-47101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47101:
--

Assignee: (was: Apache Spark)

> HiveExternalCatalog.verifyDataSchema does not fully comply with hive column 
> name rules
> --
>
> Key: SPARK-47101
> URL: https://issues.apache.org/jira/browse/SPARK-47101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules

2024-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47101:
--

Assignee: Apache Spark

> HiveExternalCatalog.verifyDataSchema does not fully comply with hive column 
> name rules
> --
>
> Key: SPARK-47101
> URL: https://issues.apache.org/jira/browse/SPARK-47101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47116) Install proper Python version in SparkR Windows build to avoid warnings

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47116.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45196
[https://github.com/apache/spark/pull/45196]

> Install proper Python version in SparkR Windows build to avoid warnings
> ---
>
> Key: SPARK-47116
> URL: https://issues.apache.org/jira/browse/SPARK-47116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830
> {code}
> Traceback (most recent call last):
>   File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 183, 
> in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>   File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 109, 
> in _get_module_details
> __import__(pkg_name)
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\__init__.py", line 
> [53](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:54),
>  in 
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\rdd.py", line 
> [54](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:55),
>  in 
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\java_gateway.py", 
> line 33, in 
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 
> 69, in 
>   File 
> "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\__init__.py", 
> line 1, in 
>   File 
> "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\cloudpickle.py", 
> line 80, in 
> ImportError: cannot import name 'CellType' from 'types' 
> (C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\types.py)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47116) Install proper Python version in SparkR Windows build to avoid warnings

2024-02-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47116:
-

Assignee: Hyukjin Kwon

> Install proper Python version in SparkR Windows build to avoid warnings
> ---
>
> Key: SPARK-47116
> URL: https://issues.apache.org/jira/browse/SPARK-47116
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830
> {code}
> Traceback (most recent call last):
>   File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 183, 
> in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>   File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 109, 
> in _get_module_details
> __import__(pkg_name)
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\__init__.py", line 
> [53](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:54),
>  in 
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\rdd.py", line 
> [54](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:55),
>  in 
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\java_gateway.py", 
> line 33, in 
>   File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 
> 69, in 
>   File 
> "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\__init__.py", 
> line 1, in 
>   File 
> "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\cloudpickle.py", 
> line 80, in 
> ImportError: cannot import name 'CellType' from 'types' 
> (C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\types.py)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47114) In the spark driver pod. Failed to access the krb5 file

2024-02-21 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-47114:
--
Description: 
spark runs in kubernetes and accesses an external hdfs cluster (kerberos),pod 
error logs
{code:java}
Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf loading 
failed{code}
This error generally occurs when the krb5 file cannot be found

[~yao] [~Qin Yao] 
{code:java}
./bin/spark-submit \
    --master k8s://https://172.18.5.44:6443 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=1 \
    --conf spark.kubernetes.submission.waitAppCompletion=true \
    --conf spark.kubernetes.driver.pod.name=spark-xxx \
    --conf spark.kubernetes.executor.podNamePrefix=spark-executor-xxx \
    --conf spark.kubernetes.driver.label.profile=production \
    --conf spark.kubernetes.executor.label.profile=production \
    --conf spark.kubernetes.namespace=superior \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf 
spark.kubernetes.container.image=registry.cn-hangzhou.aliyuncs.com/melin1204/spark-jobserver:3.4.0
 \
    --conf 
spark.kubernetes.file.upload.path=hdfs://cdh1:8020/user/superior/kubernetes/ \
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    --conf spark.kubernetes.container.image.pullSecrets=docker-reg-demos \
    --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf  \
    --conf spark.kerberos.principal=superior/ad...@datacyber.com  \
    --conf spark.kerberos.keytab=/root/superior.keytab  \
    
file:///root/spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar
  5{code}
{code:java}
(base) [root@cdh1 ~]# kubectl logs spark-xxx -n superior
Exception in thread "main" java.lang.IllegalArgumentException: Can't get 
Kerberos realm
        at 
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:71)
        at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
        at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
        at 
org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:395)
        at 
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:389)
        at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1119)
        at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:385)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf loading 
failed
        at 
java.security.jgss/javax.security.auth.kerberos.KerberosPrincipal.(Unknown
 Source)
        at 
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:120)
        at 
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:69)
        ... 13 more
(base) [root@cdh1 ~]# kubectl describe pod spark-xxx -n superior
Name:             spark-xxx
Namespace:        superior
Priority:         0
Service Account:  spark
Node:             cdh2/172.18.5.45
Start Time:       Wed, 21 Feb 2024 15:48:08 +0800
Labels:           profile=production
                  spark-app-name=spark-pi
                  spark-app-selector=spark-728e24e49f9040fa86b04c521463020b
                  spark-role=driver
                  spark-version=3.4.2
Annotations:      
Status:           Failed
IP:               10.244.1.4
IPs:
  IP:  10.244.1.4
Containers:
  spark-kubernetes-driver:
    Container ID:  
containerd://cceaf13b70cc5f21a639e71cb8663989ec73e122380844624d4bfac3946bae15
    Image:         spark:3.4.1
    Image ID:      
docker.io/library/spark@sha256:69fb485a0bcad88f9a2bf066e1b5d555f818126dc9df5a0b7e6a3b6d364bc694
    Ports:         7078/TCP, 7079/TCP, 4040/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.examples.SparkPi
      spark-internal
      5
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 21 Feb 2024 15:49:54 +0800
      Finished:     Wed, 21 Feb 2024 15:49:56