[jira] [Resolved] (SPARK-46809) Check error message parameter properly
[ https://issues.apache.org/jira/browse/SPARK-46809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-46809. - Resolution: Not A Bug Seems to be working fine > Check error message parameter properly > -- > > Key: SPARK-46809 > URL: https://issues.apache.org/jira/browse/SPARK-46809 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > If error message parameter from template is missing in actual usage or the > name is different, it should raise exception but currently it's not. We > should handle this to work properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47127) Update `SKIP_SPARK_RELEASE_VERSIONS` in Maven CIs
[ https://issues.apache.org/jira/browse/SPARK-47127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47127. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45212 [https://github.com/apache/spark/pull/45212] > Update `SKIP_SPARK_RELEASE_VERSIONS` in Maven CIs > - > > Key: SPARK-47127 > URL: https://issues.apache.org/jira/browse/SPARK-47127 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We need to skip newly released Apache Spark 3.5.1 and remove removed 3.3.4. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44914) Upgrade Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44914: -- Summary: Upgrade Ivy to 2.5.2 (was: Upgrade Apache Ivy to 2.5.2) > Upgrade Ivy to 2.5.2 > > > Key: SPARK-44914 > URL: https://issues.apache.org/jira/browse/SPARK-44914 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44914) Upgrade Apache Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44914: -- Summary: Upgrade Apache Ivy to 2.5.2 (was: Upgrade Apache ivy to 2.5.2) > Upgrade Apache Ivy to 2.5.2 > --- > > Key: SPARK-44914 > URL: https://issues.apache.org/jira/browse/SPARK-44914 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47126) Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite
Dongjoon Hyun created SPARK-47126: - Summary: Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite Key: SPARK-47126 URL: https://issues.apache.org/jira/browse/SPARK-47126 Project: Spark Issue Type: Sub-task Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun HiveExternalCatalogVersionsSuite requires SPARK-46400 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47125) Return null if Univocity never triggers parsing
[ https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47125: - Fix Version/s: 3.5.2 (was: 3.5.3) > Return null if Univocity never triggers parsing > --- > > Key: SPARK-47125 > URL: https://issues.apache.org/jira/browse/SPARK-47125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > > See the linked PR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47125) Return null if Univocity never triggers parsing
[ https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47125: - Fix Version/s: 3.5.3 3.4.3 > Return null if Univocity never triggers parsing > --- > > Key: SPARK-47125 > URL: https://issues.apache.org/jira/browse/SPARK-47125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.4.3, 3.5.3 > > > See the linked PR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47117) Format the whole code base
[ https://issues.apache.org/jira/browse/SPARK-47117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-47117: - Priority: Trivial (was: Major) > Format the whole code base > -- > > Key: SPARK-47117 > URL: https://issues.apache.org/jira/browse/SPARK-47117 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jie Han >Priority: Trivial > > I try scalafmt and find that it produces so much diff :(. Do we need format > the whole code base in a pr? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47125) Return null if Univocity never triggers parsing
[ https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47125: Assignee: Hyukjin Kwon > Return null if Univocity never triggers parsing > --- > > Key: SPARK-47125 > URL: https://issues.apache.org/jira/browse/SPARK-47125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > See the linked PR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47125) Return null if Univocity never triggers parsing
[ https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47125. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45210 [https://github.com/apache/spark/pull/45210] > Return null if Univocity never triggers parsing > --- > > Key: SPARK-47125 > URL: https://issues.apache.org/jira/browse/SPARK-47125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > See the linked PR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)
[ https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34631: - Description: emphasized textcreate a table, set location as parquet, do msck repair table to get the data. But when query with partition column, got some errors (adding backtick would not help) {code:java} // code placeholder {code} select count from some_table where `$partition_date` = '2015-01-01' {panel:title=error:} java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:962) at org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:174) at org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:166) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2470) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:191) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at org.apache.spark.sql.Dataset.head(Dataset.scala:2550) at org.apache.spark.sql.Dataset.take(Dataset.scala:2764) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254) at org.apache.spark.sql.Dataset.showString(Dataset.scala:291) at org.apache.spark.sql.Dataset.show(Dataset.scala:751) at org.apache.spark.sql.Dataset.show(Dataset.scala:710) at org.apache.spark.sql.Dataset.show(Dataset.scala:719) ... 49 elided Caused by: java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition filter : line 1:0 no
[jira] [Resolved] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules
[ https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47101. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45180 [https://github.com/apache/spark/pull/45180] > HiveExternalCatalog.verifyDataSchema does not fully comply with hive column > name rules > -- > > Key: SPARK-47101 > URL: https://issues.apache.org/jira/browse/SPARK-47101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules
[ https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47101: - Assignee: Kent Yao > HiveExternalCatalog.verifyDataSchema does not fully comply with hive column > name rules > -- > > Key: SPARK-47101 > URL: https://issues.apache.org/jira/browse/SPARK-47101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47036) RocksDB versionID Mismatch in SST files with Compaction
[ https://issues.apache.org/jira/browse/SPARK-47036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-47036: - Fix Version/s: 3.5.2 > RocksDB versionID Mismatch in SST files with Compaction > --- > > Key: SPARK-47036 > URL: https://issues.apache.org/jira/browse/SPARK-47036 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Bhuwan Sahni >Assignee: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > RocksDB compaction can result in version Id mismatch errors if the same > version is committed twice from the same executor. (Multiple commits can > happen due to Spark Stage/task retry). > A particular scenario where this can happen is provided below: > # Version V1 is loaded on executor A, RocksDB State Store has 195.sst, > 196.sst, 197.sst and 198.sst files. > 2. State changes are made, which result in creation of a new table file > 200.sst. > 3. State store is committed as version V2. The SST file 200.sst (as > 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and > previous 4 files are reused. A new metadata file is created to track the > exact SST files with unique IDs, and uploaded with RocksDB Manifest as part > of V1.zip. > 4. Rocks DB compaction is triggered at the same time. The compaction creates > a new L1 file (201.sst), and deletes existing 5 SST files. > 5. Spark Stage is retried. > 6. Version V1 is reloaded on the same executor. The local files are > inspected, and 201.sst is deleted. The 4 SST files in version V1 are > downloaded again to local file system. > 7. Any local files which are deleted (as part of version load) are also > removed from local → DFS file upload tracking. **However, the files already > deleted as a result of compaction are not removed from tracking. This is the > bug which resulted in the failure.** > 8. State store is committed as version V1. However, the local mapping of SST > files to DFS file path still has 200.sst in its tracking, hence the SST file > is not re-uploaded. A new metadata file is created to track the exact SST > files with unique IDs, and uploaded with the new RocksDB Manifest as part of > V2.zip. (The V2.zip file is overwritten here atomically) > 9. A new executor tried to load version V2. However, the SST files in (1) are > now incompatible with Manifest file in (6) resulting in the version Id > mismatch failure. > > We need to ensure that any files deleted from local filesystem post > compaction are not tracked in uploadedDFSFiles mapping if the same version is > loaded again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default
[ https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47124: - Assignee: Hyukjin Kwon > Skip scheduled SparkR on Windows in fork repositories by default > > > Key: SPARK-47124 > URL: https://issues.apache.org/jira/browse/SPARK-47124 > Project: Spark > Issue Type: Test > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > To be consistent with other scheduled build. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default
[ https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47124. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45208 [https://github.com/apache/spark/pull/45208] > Skip scheduled SparkR on Windows in fork repositories by default > > > Key: SPARK-47124 > URL: https://issues.apache.org/jira/browse/SPARK-47124 > Project: Spark > Issue Type: Test > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > To be consistent with other scheduled build. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47125) Return null if Univocity never triggers parsing
[ https://issues.apache.org/jira/browse/SPARK-47125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47125: --- Labels: pull-request-available (was: ) > Return null if Univocity never triggers parsing > --- > > Key: SPARK-47125 > URL: https://issues.apache.org/jira/browse/SPARK-47125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > See the linked PR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47125) Return null if Univocity never triggers parsing
Hyukjin Kwon created SPARK-47125: Summary: Return null if Univocity never triggers parsing Key: SPARK-47125 URL: https://issues.apache.org/jira/browse/SPARK-47125 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.2, 4.0.0, 3.5.1 Reporter: Hyukjin Kwon See the linked PR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema
[ https://issues.apache.org/jira/browse/SPARK-47123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47123: --- Labels: pull-request-available (was: ) > JDBCRDD does not correctly handle errors in getQueryOutputSchema > > > Key: SPARK-47123 > URL: https://issues.apache.org/jira/browse/SPARK-47123 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > Labels: pull-request-available > > If there is an error executing statement.executeQuery(), it's possible that > another error in one of the finally statements makes us not see the main > error. > {code:java} > def getQueryOutputSchema( > query: String, options: JDBCOptions, dialect: JdbcDialect): StructType > = { > val conn: Connection = dialect.createConnectionFactory(options)(-1) > try { > val statement = conn.prepareStatement(query) > try { > statement.setQueryTimeout(options.queryTimeout) > val rs = statement.executeQuery() > try { > JdbcUtils.getSchema(rs, dialect, alwaysNullable = true, > isTimestampNTZ = options.preferTimestampNTZ) > } finally { > rs.close() > } > } finally { > statement.close() > } > } finally { > conn.close() > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default
[ https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47124: --- Labels: pull-request-available (was: ) > Skip scheduled SparkR on Windows in fork repositories by default > > > Key: SPARK-47124 > URL: https://issues.apache.org/jira/browse/SPARK-47124 > Project: Spark > Issue Type: Test > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > To be consistent with other scheduled build. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default
[ https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47124: - Priority: Minor (was: Major) > Skip scheduled SparkR on Windows in fork repositories by default > > > Key: SPARK-47124 > URL: https://issues.apache.org/jira/browse/SPARK-47124 > Project: Spark > Issue Type: Test > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > To be consistent with other scheduled build. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47124) Skip scheduled SparkR on Windows in fork repositories by default
[ https://issues.apache.org/jira/browse/SPARK-47124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47124: - Summary: Skip scheduled SparkR on Windows in fork repositories by default (was: Skip SparkR on Windows in fork repositories) > Skip scheduled SparkR on Windows in fork repositories by default > > > Key: SPARK-47124 > URL: https://issues.apache.org/jira/browse/SPARK-47124 > Project: Spark > Issue Type: Test > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > To be consistent with other scheduled build. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47124) Skip SparkR on Windows in fork repositories
Hyukjin Kwon created SPARK-47124: Summary: Skip SparkR on Windows in fork repositories Key: SPARK-47124 URL: https://issues.apache.org/jira/browse/SPARK-47124 Project: Spark Issue Type: Test Components: Project Infra, R Affects Versions: 4.0.0 Reporter: Hyukjin Kwon To be consistent with other scheduled build. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31745) Enable Hive related test cases of SparkR on Windows
[ https://issues.apache.org/jira/browse/SPARK-31745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-31745: --- Labels: pull-request-available (was: ) > Enable Hive related test cases of SparkR on Windows > --- > > Key: SPARK-31745 > URL: https://issues.apache.org/jira/browse/SPARK-31745 > Project: Spark > Issue Type: Test > Components: SparkR, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > Hive related tests look being skipped in AppVeyor: > {code} > test_sparkSQL.R:307: skip: create DataFrame from RDD > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:1341: skip: test HiveContext > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:2813: skip: read/write ORC files > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:2834: skip: read/write ORC files - compression option > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:3727: skip: enableHiveSupport on SparkSession > Reason: Hive is not build with SparkSQL, skipped > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31745) Enable Hive related test cases of SparkR on Windows
[ https://issues.apache.org/jira/browse/SPARK-31745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31745: - Summary: Enable Hive related test cases of SparkR on Windows (was: Enable Hive related test cases of SparkR in AppVeyor) > Enable Hive related test cases of SparkR on Windows > --- > > Key: SPARK-31745 > URL: https://issues.apache.org/jira/browse/SPARK-31745 > Project: Spark > Issue Type: Test > Components: SparkR, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Hive related tests look being skipped in AppVeyor: > {code} > test_sparkSQL.R:307: skip: create DataFrame from RDD > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:1341: skip: test HiveContext > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:2813: skip: read/write ORC files > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:2834: skip: read/write ORC files - compression option > Reason: Hive is not build with SparkSQL, skipped > test_sparkSQL.R:3727: skip: enableHiveSupport on SparkSession > Reason: Hive is not build with SparkSQL, skipped > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema
Pablo Langa Blanco created SPARK-47123: -- Summary: JDBCRDD does not correctly handle errors in getQueryOutputSchema Key: SPARK-47123 URL: https://issues.apache.org/jira/browse/SPARK-47123 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0, 4.0.0 Reporter: Pablo Langa Blanco If there is an error executing statement.executeQuery(), it's possible that another error in one of the finally statements makes us not see the main error. {code:java} def getQueryOutputSchema( query: String, options: JDBCOptions, dialect: JdbcDialect): StructType = { val conn: Connection = dialect.createConnectionFactory(options)(-1) try { val statement = conn.prepareStatement(query) try { statement.setQueryTimeout(options.queryTimeout) val rs = statement.executeQuery() try { JdbcUtils.getSchema(rs, dialect, alwaysNullable = true, isTimestampNTZ = options.preferTimestampNTZ) } finally { rs.close() } } finally { statement.close() } } finally { conn.close() } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema
[ https://issues.apache.org/jira/browse/SPARK-47123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819413#comment-17819413 ] Pablo Langa Blanco commented on SPARK-47123: I'm working on it > JDBCRDD does not correctly handle errors in getQueryOutputSchema > > > Key: SPARK-47123 > URL: https://issues.apache.org/jira/browse/SPARK-47123 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > If there is an error executing statement.executeQuery(), it's possible that > another error in one of the finally statements makes us not see the main > error. > {code:java} > def getQueryOutputSchema( > query: String, options: JDBCOptions, dialect: JdbcDialect): StructType > = { > val conn: Connection = dialect.createConnectionFactory(options)(-1) > try { > val statement = conn.prepareStatement(query) > try { > statement.setQueryTimeout(options.queryTimeout) > val rs = statement.executeQuery() > try { > JdbcUtils.getSchema(rs, dialect, alwaysNullable = true, > isTimestampNTZ = options.preferTimestampNTZ) > } finally { > rs.close() > } > } finally { > statement.close() > } > } finally { > conn.close() > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47121) Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown
[ https://issues.apache.org/jira/browse/SPARK-47121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47121. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45203 [https://github.com/apache/spark/pull/45203] > Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend > shutdown > -- > > Key: SPARK-47121 > URL: https://issues.apache.org/jira/browse/SPARK-47121 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 3.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > While it is in the process of shutting down, the StandaloneSchedulerBackend > might throw RejectedExecutionExceptions when RPC handler `onDisconnected` > methods attempt to submit new tasks to a stopped executorDelayRemoveThread > executor service. We can reduce log and uncaught exception noise by catching > and ignoring these exceptions if they occur during shudown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`
[ https://issues.apache.org/jira/browse/SPARK-47122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47122: - Assignee: Dongjoon Hyun > Pin `buf-setup-action` to `v1.29.0` > --- > > Key: SPARK-47122 > URL: https://issues.apache.org/jira/browse/SPARK-47122 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`
[ https://issues.apache.org/jira/browse/SPARK-47122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47122. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45205 [https://github.com/apache/spark/pull/45205] > Pin `buf-setup-action` to `v1.29.0` > --- > > Key: SPARK-47122 > URL: https://issues.apache.org/jira/browse/SPARK-47122 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`
[ https://issues.apache.org/jira/browse/SPARK-47122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47122: --- Labels: pull-request-available (was: ) > Pin `buf-setup-action` to `v1.29.0` > --- > > Key: SPARK-47122 > URL: https://issues.apache.org/jira/browse/SPARK-47122 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47122) Pin `buf-setup-action` to `v1.29.0`
Dongjoon Hyun created SPARK-47122: - Summary: Pin `buf-setup-action` to `v1.29.0` Key: SPARK-47122 URL: https://issues.apache.org/jira/browse/SPARK-47122 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47104. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45199 [https://github.com/apache/spark/pull/45199] > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Assignee: Bruce Robbins >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] >
[jira] [Assigned] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47104: - Assignee: Bruce Robbins > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Assignee: Bruce Robbins >Priority: Major > Labels: pull-request-available > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. --
[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47104: -- Affects Version/s: 3.0.3 > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Priority: Major > Labels: pull-request-available > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. -- This message was sent by Atlassian Jira
[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47104: -- Affects Version/s: 3.1.3 > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Priority: Major > Labels: pull-request-available > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. -- This message was sent by Atlassian Jira
[jira] [Resolved] (SPARK-47119) Add `hive-jackson-provided` profile
[ https://issues.apache.org/jira/browse/SPARK-47119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47119. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45201 [https://github.com/apache/spark/pull/45201] > Add `hive-jackson-provided` profile > --- > > Key: SPARK-47119 > URL: https://issues.apache.org/jira/browse/SPARK-47119 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47036) RocksDB versionID Mismatch in SST files with Compaction
[ https://issues.apache.org/jira/browse/SPARK-47036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-47036: Assignee: Bhuwan Sahni > RocksDB versionID Mismatch in SST files with Compaction > --- > > Key: SPARK-47036 > URL: https://issues.apache.org/jira/browse/SPARK-47036 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Bhuwan Sahni >Assignee: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > > RocksDB compaction can result in version Id mismatch errors if the same > version is committed twice from the same executor. (Multiple commits can > happen due to Spark Stage/task retry). > A particular scenario where this can happen is provided below: > # Version V1 is loaded on executor A, RocksDB State Store has 195.sst, > 196.sst, 197.sst and 198.sst files. > 2. State changes are made, which result in creation of a new table file > 200.sst. > 3. State store is committed as version V2. The SST file 200.sst (as > 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and > previous 4 files are reused. A new metadata file is created to track the > exact SST files with unique IDs, and uploaded with RocksDB Manifest as part > of V1.zip. > 4. Rocks DB compaction is triggered at the same time. The compaction creates > a new L1 file (201.sst), and deletes existing 5 SST files. > 5. Spark Stage is retried. > 6. Version V1 is reloaded on the same executor. The local files are > inspected, and 201.sst is deleted. The 4 SST files in version V1 are > downloaded again to local file system. > 7. Any local files which are deleted (as part of version load) are also > removed from local → DFS file upload tracking. **However, the files already > deleted as a result of compaction are not removed from tracking. This is the > bug which resulted in the failure.** > 8. State store is committed as version V1. However, the local mapping of SST > files to DFS file path still has 200.sst in its tracking, hence the SST file > is not re-uploaded. A new metadata file is created to track the exact SST > files with unique IDs, and uploaded with the new RocksDB Manifest as part of > V2.zip. (The V2.zip file is overwritten here atomically) > 9. A new executor tried to load version V2. However, the SST files in (1) are > now incompatible with Manifest file in (6) resulting in the version Id > mismatch failure. > > We need to ensure that any files deleted from local filesystem post > compaction are not tracked in uploadedDFSFiles mapping if the same version is > loaded again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47036) RocksDB versionID Mismatch in SST files with Compaction
[ https://issues.apache.org/jira/browse/SPARK-47036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47036. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45092 [https://github.com/apache/spark/pull/45092] > RocksDB versionID Mismatch in SST files with Compaction > --- > > Key: SPARK-47036 > URL: https://issues.apache.org/jira/browse/SPARK-47036 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Bhuwan Sahni >Assignee: Bhuwan Sahni >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > RocksDB compaction can result in version Id mismatch errors if the same > version is committed twice from the same executor. (Multiple commits can > happen due to Spark Stage/task retry). > A particular scenario where this can happen is provided below: > # Version V1 is loaded on executor A, RocksDB State Store has 195.sst, > 196.sst, 197.sst and 198.sst files. > 2. State changes are made, which result in creation of a new table file > 200.sst. > 3. State store is committed as version V2. The SST file 200.sst (as > 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and > previous 4 files are reused. A new metadata file is created to track the > exact SST files with unique IDs, and uploaded with RocksDB Manifest as part > of V1.zip. > 4. Rocks DB compaction is triggered at the same time. The compaction creates > a new L1 file (201.sst), and deletes existing 5 SST files. > 5. Spark Stage is retried. > 6. Version V1 is reloaded on the same executor. The local files are > inspected, and 201.sst is deleted. The 4 SST files in version V1 are > downloaded again to local file system. > 7. Any local files which are deleted (as part of version load) are also > removed from local → DFS file upload tracking. **However, the files already > deleted as a result of compaction are not removed from tracking. This is the > bug which resulted in the failure.** > 8. State store is committed as version V1. However, the local mapping of SST > files to DFS file path still has 200.sst in its tracking, hence the SST file > is not re-uploaded. A new metadata file is created to track the exact SST > files with unique IDs, and uploaded with the new RocksDB Manifest as part of > V2.zip. (The V2.zip file is overwritten here atomically) > 9. A new executor tried to load version V2. However, the SST files in (1) are > now incompatible with Manifest file in (6) resulting in the version Id > mismatch failure. > > We need to ensure that any files deleted from local filesystem post > compaction are not tracked in uploadedDFSFiles mapping if the same version is > loaded again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47121) Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown
[ https://issues.apache.org/jira/browse/SPARK-47121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47121: --- Labels: pull-request-available (was: ) > Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend > shutdown > -- > > Key: SPARK-47121 > URL: https://issues.apache.org/jira/browse/SPARK-47121 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 3.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: pull-request-available > > While it is in the process of shutting down, the StandaloneSchedulerBackend > might throw RejectedExecutionExceptions when RPC handler `onDisconnected` > methods attempt to submit new tasks to a stopped executorDelayRemoveThread > executor service. We can reduce log and uncaught exception noise by catching > and ignoring these exceptions if they occur during shudown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47121) Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown
Josh Rosen created SPARK-47121: -- Summary: Avoid noisy RejectedExecutionExceptions during StandaloneSchedulerBackend shutdown Key: SPARK-47121 URL: https://issues.apache.org/jira/browse/SPARK-47121 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 3.5.0 Reporter: Josh Rosen Assignee: Josh Rosen While it is in the process of shutting down, the StandaloneSchedulerBackend might throw RejectedExecutionExceptions when RPC handler `onDisconnected` methods attempt to submit new tasks to a stopped executorDelayRemoveThread executor service. We can reduce log and uncaught exception noise by catching and ignoring these exceptions if they occur during shudown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47120: --- Labels: pull-request-available (was: ) > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > Labels: pull-request-available > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} > I'll provide a fix PR shortly -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-47120: --- Description: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} [fix PR |https://github.com/apache/spark/pull/45202/files] was: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} I'll provide a fix PR shortly > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > Labels: pull-request-available > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} > [fix PR |https://github.com/apache/spark/pull/45202/files] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-47120: --- Description: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} I'll provide a fix PR shortly was: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} > I'll provide a fix PR shortly -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-47120: --- Description: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47119) Add `hive-jackson-provided` profile
[ https://issues.apache.org/jira/browse/SPARK-47119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47119: --- Labels: pull-request-available (was: ) > Add `hive-jackson-provided` profile > --- > > Key: SPARK-47119 > URL: https://issues.apache.org/jira/browse/SPARK-47119 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47119) Add `hive-jackson-provided` profile
Dongjoon Hyun created SPARK-47119: - Summary: Add `hive-jackson-provided` profile Key: SPARK-47119 URL: https://issues.apache.org/jira/browse/SPARK-47119 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47119) Add `hive-jackson-provided` profile
[ https://issues.apache.org/jira/browse/SPARK-47119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47119: - Assignee: Dongjoon Hyun > Add `hive-jackson-provided` profile > --- > > Key: SPARK-47119 > URL: https://issues.apache.org/jira/browse/SPARK-47119 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47099) The `start` value of `paramIndex` for the error class `UNEXPECTED_INPUT_TYPE` should be `1`1
[ https://issues.apache.org/jira/browse/SPARK-47099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47099. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45177 [https://github.com/apache/spark/pull/45177] > The `start` value of `paramIndex` for the error class `UNEXPECTED_INPUT_TYPE` > should be `1`1 > > > Key: SPARK-47099 > URL: https://issues.apache.org/jira/browse/SPARK-47099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47104: --- Labels: pull-request-available (was: ) > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Priority: Major > Labels: pull-request-available > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. -- This message was sent by Atlassian Jira
[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44719: --- Labels: pull-request-available (was: ) > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0, 4.0.0 > > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > {noformat} > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
[ https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819341#comment-17819341 ] Dongjoon Hyun commented on SPARK-43225: --- For the record, this is logically reverted via SPARK-44719 at Spark 3.5.0 by the author, [~yumwang]. > Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution > -- > > Key: SPARK-43225 > URL: https://issues.apache.org/jira/browse/SPARK-43225 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.5.0 > > > To fix CVE issue: https://github.com/apache/spark/security/dependabot/50 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47085: -- Fix Version/s: 3.4.3 > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > > This new complexity was introduced in SPARK-39041. > before this PR the code was: > {code:java} > while (curRow < maxRows && iter.hasNext) { > val sparkRow = iter.next() > val row = ArrayBuffer[Any]() > var curCol = 0 > while (curCol < sparkRow.length) { > if (sparkRow.isNullAt(curCol)) { > row += null > } else { > addNonNullColumnValue(sparkRow, row, curCol, timeFormatters) > } > curCol += 1 > } > resultRowSet.addRow(row.toArray.asInstanceOf[Array[Object]]) > curRow += 1 > }{code} > foreach without the _*O(n^2)*_ complexity so this change just return the > state to what it was before. > > In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted back into _*O( n )*_ complexity. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46934) Read/write roundtrip for struct type with special characters with HMS
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819309#comment-17819309 ] Dongjoon Hyun commented on SPARK-46934: --- You can track Apache Spark 4.0.0 activity in SPARK-44111, [~yutinglin]. > Read/write roundtrip for struct type with special characters with HMS > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at >
[jira] [Commented] (SPARK-46934) Read/write roundtrip for struct type with special characters with HMS
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819308#comment-17819308 ] Dongjoon Hyun commented on SPARK-46934: --- Thank you for confirming, [~yutinglin] and [~yao] . I revised this JIRA issue to `Improvement` JIRA with the `Affected Versions` 4.0.0. According to the Semantic Versioning policy, new features and improvements are only for unable to affect maintenance releases with PATCH version changes. - https://spark.apache.org/versioning-policy.html - https://semver.org > Read/write roundtrip for struct type with special characters with HMS > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at
[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47104: -- Affects Version/s: 3.5.0 3.4.2 > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Priority: Major > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46934: -- Issue Type: Improvement (was: Bug) > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at >
[jira] [Updated] (SPARK-46934) Read/write roundtrip for struct type with special characters with HMS
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46934: -- Summary: Read/write roundtrip for struct type with special characters with HMS (was: Unable to create Hive View from certain Spark Dataframe StructType) > Read/write roundtrip for struct type with special characters with HMS > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at >
[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46934: -- Priority: Major (was: Blocker) > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.2, 3.3.4 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at >
[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46934: -- Affects Version/s: 4.0.0 (was: 3.3.0) (was: 3.3.2) (was: 3.3.4) > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at >
[jira] [Updated] (SPARK-46938) Remove javax-servlet-api exclusion rule for SBT
[ https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46938: -- Summary: Remove javax-servlet-api exclusion rule for SBT (was: Migrate jetty 10 to jetty 11) > Remove javax-servlet-api exclusion rule for SBT > --- > > Key: SPARK-46938 > URL: https://issues.apache.org/jira/browse/SPARK-46938 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: HiuFung Kwok >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46938) Migrate jetty 10 to jetty 11
[ https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46938. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45194 [https://github.com/apache/spark/pull/45194] > Migrate jetty 10 to jetty 11 > > > Key: SPARK-46938 > URL: https://issues.apache.org/jira/browse/SPARK-46938 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: HiuFung Kwok >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
[ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819249#comment-17819249 ] Ramakrishna edited comment on SPARK-21595 at 2/21/24 1:34 PM: -- [~Rakesh_Shah] How did you manage to solve this ? I am getting this in my streaming query, it does aggregations similar to other streaming queries in same job. However it fails and I get {"timestamp":"21/02/2024 07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output location for shuffle 5 partition 35"} Can you please help ? [~tejasp] Can you please help ? My spark version is 3.4.0 was (Author: hande): [~Rakesh_Shah] How did you manage to solve this ? I am getting this in my streaming query, it does aggregations similar to other streaming queries in same job. However it fails and I get {"timestamp":"21/02/2024 07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output location for shuffle 5 partition 35"} Can you please help ? [~tejasp] Can you please help ? > introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 > breaks existing workflow > - > > Key: SPARK-21595 > URL: https://issues.apache.org/jira/browse/SPARK-21595 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.2.0 > Environment: pyspark on linux >Reporter: Stephan Reiling >Assignee: Tejas Patil >Priority: Minor > Labels: documentation, regression > Fix For: 2.2.1, 2.3.0 > > > My pyspark code has the following statement: > {code:java} > # assign row key for tracking > df = df.withColumn( > 'association_idx', > sqlf.row_number().over( > Window.orderBy('uid1', 'uid2') > ) > ) > {code} > where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates > one large window for the whole dataframe to sort over. > In spark 2.1 this works without problem, in spark 2.2 this fails either with > out of memory exception or too many open files exception, depending on memory > settings (which is what I tried first to fix this). > Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 > creates >110,000 files. > In the log I see the following messages (110,000 of these): > {noformat} > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (0 time so far) > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (1 time so far) > {noformat} > So I started hunting for clues in UnsafeExternalSorter, without luck. What I > had missed was this one message: > {noformat} > 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill > threshold of 4096 rows, switching to > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter > {noformat} > Which allowed me to track down the issue. > By changing the configuration to include: > {code:java} > spark.sql.windowExec.buffer.spill.threshold 2097152 > {code} > I got it to work again and with the same performance as spark 2.1. > I have workflows where I use windowing functions that do not fail, but took a > performance hit due to the excessive spilling when using the default of 4096. > I think to make it easier to track down these issues this config variable > should be included in the configuration documentation. > Maybe 4096 is too small of a default value? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
[ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819249#comment-17819249 ] Ramakrishna commented on SPARK-21595: - [~Rakesh_Shah] How did you manage to solve this ? I am getting this in my streaming query, it does aggregations similar to other streaming queries in same job. However it fails and I get {"timestamp":"21/02/2024 07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output location for shuffle 5 partition 35"} Can you please help ? [~tejasp] Can you please help ? > introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 > breaks existing workflow > - > > Key: SPARK-21595 > URL: https://issues.apache.org/jira/browse/SPARK-21595 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.2.0 > Environment: pyspark on linux >Reporter: Stephan Reiling >Assignee: Tejas Patil >Priority: Minor > Labels: documentation, regression > Fix For: 2.2.1, 2.3.0 > > > My pyspark code has the following statement: > {code:java} > # assign row key for tracking > df = df.withColumn( > 'association_idx', > sqlf.row_number().over( > Window.orderBy('uid1', 'uid2') > ) > ) > {code} > where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates > one large window for the whole dataframe to sort over. > In spark 2.1 this works without problem, in spark 2.2 this fails either with > out of memory exception or too many open files exception, depending on memory > settings (which is what I tried first to fix this). > Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 > creates >110,000 files. > In the log I see the following messages (110,000 of these): > {noformat} > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (0 time so far) > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (1 time so far) > {noformat} > So I started hunting for clues in UnsafeExternalSorter, without luck. What I > had missed was this one message: > {noformat} > 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill > threshold of 4096 rows, switching to > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter > {noformat} > Which allowed me to track down the issue. > By changing the configuration to include: > {code:java} > spark.sql.windowExec.buffer.spill.threshold 2097152 > {code} > I got it to work again and with the same performance as spark 2.1. > I have workflows where I use windowing functions that do not fail, but took a > performance hit due to the excessive spilling when using the default of 4096. > I think to make it easier to track down these issues this config variable > should be included in the configuration documentation. > Maybe 4096 is too small of a default value? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021
[ https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819237#comment-17819237 ] A G commented on SPARK-43256: - I want to work on this. PR: [https://github.com/apache/spark/pull/45198] > Assign a name to the error class _LEGACY_ERROR_TEMP_2021 > > > Key: SPARK-43256 > URL: https://issues.apache.org/jira/browse/SPARK-43256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021
[ https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43256: --- Labels: pull-request-available starter (was: starter) > Assign a name to the error class _LEGACY_ERROR_TEMP_2021 > > > Key: SPARK-43256 > URL: https://issues.apache.org/jira/browse/SPARK-43256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819232#comment-17819232 ] Kent Yao commented on SPARK-46934: -- Hi [~dongjoon], this is not a regression. As I commented above, I tried Hive 2.3.9 and I don't think Hive DDL supports '/' in these field names of struct type > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.2, 3.3.4 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Assignee: Kent Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at
[jira] [Created] (SPARK-47117) Format the whole code base
Jie Han created SPARK-47117: --- Summary: Format the whole code base Key: SPARK-47117 URL: https://issues.apache.org/jira/browse/SPARK-47117 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 4.0.0 Reporter: Jie Han I try scalafmt and find that it produces so much diff :(. Do we need format the whole code base in a pr? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46938) Migrate jetty 10 to jetty 11
[ https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46938: -- Assignee: Apache Spark (was: HiuFung Kwok) > Migrate jetty 10 to jetty 11 > > > Key: SPARK-46938 > URL: https://issues.apache.org/jira/browse/SPARK-46938 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46938) Migrate jetty 10 to jetty 11
[ https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46938: -- Assignee: Apache Spark (was: HiuFung Kwok) > Migrate jetty 10 to jetty 11 > > > Key: SPARK-46938 > URL: https://issues.apache.org/jira/browse/SPARK-46938 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules
[ https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47101: -- Assignee: (was: Apache Spark) > HiveExternalCatalog.verifyDataSchema does not fully comply with hive column > name rules > -- > > Key: SPARK-47101 > URL: https://issues.apache.org/jira/browse/SPARK-47101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules
[ https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47101: -- Assignee: Apache Spark > HiveExternalCatalog.verifyDataSchema does not fully comply with hive column > name rules > -- > > Key: SPARK-47101 > URL: https://issues.apache.org/jira/browse/SPARK-47101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules
[ https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47101: -- Assignee: (was: Apache Spark) > HiveExternalCatalog.verifyDataSchema does not fully comply with hive column > name rules > -- > > Key: SPARK-47101 > URL: https://issues.apache.org/jira/browse/SPARK-47101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47101) HiveExternalCatalog.verifyDataSchema does not fully comply with hive column name rules
[ https://issues.apache.org/jira/browse/SPARK-47101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47101: -- Assignee: Apache Spark > HiveExternalCatalog.verifyDataSchema does not fully comply with hive column > name rules > -- > > Key: SPARK-47101 > URL: https://issues.apache.org/jira/browse/SPARK-47101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47116) Install proper Python version in SparkR Windows build to avoid warnings
[ https://issues.apache.org/jira/browse/SPARK-47116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47116. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45196 [https://github.com/apache/spark/pull/45196] > Install proper Python version in SparkR Windows build to avoid warnings > --- > > Key: SPARK-47116 > URL: https://issues.apache.org/jira/browse/SPARK-47116 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830 > {code} > Traceback (most recent call last): > File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 183, > in _run_module_as_main > mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 109, > in _get_module_details > __import__(pkg_name) > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\__init__.py", line > [53](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:54), > in > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\rdd.py", line > [54](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:55), > in > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\java_gateway.py", > line 33, in > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line > 69, in > File > "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\__init__.py", > line 1, in > File > "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\cloudpickle.py", > line 80, in > ImportError: cannot import name 'CellType' from 'types' > (C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\types.py) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47116) Install proper Python version in SparkR Windows build to avoid warnings
[ https://issues.apache.org/jira/browse/SPARK-47116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47116: - Assignee: Hyukjin Kwon > Install proper Python version in SparkR Windows build to avoid warnings > --- > > Key: SPARK-47116 > URL: https://issues.apache.org/jira/browse/SPARK-47116 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830 > {code} > Traceback (most recent call last): > File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 183, > in _run_module_as_main > mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 109, > in _get_module_details > __import__(pkg_name) > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\__init__.py", line > [53](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:54), > in > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\rdd.py", line > [54](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:55), > in > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\java_gateway.py", > line 33, in > File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line > 69, in > File > "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\__init__.py", > line 1, in > File > "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\cloudpickle.py", > line 80, in > ImportError: cannot import name 'CellType' from 'types' > (C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\types.py) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47114) In the spark driver pod. Failed to access the krb5 file
[ https://issues.apache.org/jira/browse/SPARK-47114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-47114: -- Description: spark runs in kubernetes and accesses an external hdfs cluster (kerberos),pod error logs {code:java} Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf loading failed{code} This error generally occurs when the krb5 file cannot be found [~yao] [~Qin Yao] {code:java} ./bin/spark-submit \ --master k8s://https://172.18.5.44:6443 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.submission.waitAppCompletion=true \ --conf spark.kubernetes.driver.pod.name=spark-xxx \ --conf spark.kubernetes.executor.podNamePrefix=spark-executor-xxx \ --conf spark.kubernetes.driver.label.profile=production \ --conf spark.kubernetes.executor.label.profile=production \ --conf spark.kubernetes.namespace=superior \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=registry.cn-hangzhou.aliyuncs.com/melin1204/spark-jobserver:3.4.0 \ --conf spark.kubernetes.file.upload.path=hdfs://cdh1:8020/user/superior/kubernetes/ \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image.pullSecrets=docker-reg-demos \ --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ --conf spark.kerberos.principal=superior/ad...@datacyber.com \ --conf spark.kerberos.keytab=/root/superior.keytab \ file:///root/spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar 5{code} {code:java} (base) [root@cdh1 ~]# kubectl logs spark-xxx -n superior Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:71) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300) at org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:395) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:389) at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1119) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:385) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf loading failed at java.security.jgss/javax.security.auth.kerberos.KerberosPrincipal.(Unknown Source) at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:120) at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:69) ... 13 more (base) [root@cdh1 ~]# kubectl describe pod spark-xxx -n superior Name: spark-xxx Namespace: superior Priority: 0 Service Account: spark Node: cdh2/172.18.5.45 Start Time: Wed, 21 Feb 2024 15:48:08 +0800 Labels: profile=production spark-app-name=spark-pi spark-app-selector=spark-728e24e49f9040fa86b04c521463020b spark-role=driver spark-version=3.4.2 Annotations: Status: Failed IP: 10.244.1.4 IPs: IP: 10.244.1.4 Containers: spark-kubernetes-driver: Container ID: containerd://cceaf13b70cc5f21a639e71cb8663989ec73e122380844624d4bfac3946bae15 Image: spark:3.4.1 Image ID: docker.io/library/spark@sha256:69fb485a0bcad88f9a2bf066e1b5d555f818126dc9df5a0b7e6a3b6d364bc694 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 5 State: Terminated Reason: Error Exit Code: 1 Started: Wed, 21 Feb 2024 15:49:54 +0800 Finished: Wed, 21 Feb 2024 15:49:56