[jira] [Assigned] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata
[ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39838: Assignee: (was: Apache Spark) > Passing an empty Metadata object to Column.as() should clear the metadata > - > > Key: SPARK-39838 > URL: https://issues.apache.org/jira/browse/SPARK-39838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kaya Kupferschmidt >Priority: Major > > h2. Description > The Spark DataFrame API allows developers to attach arbiotrary metadata to > individual columns as key/value pairs. The attachment is performed via the > method "Column.as(name, metadata)". This works as expected, as long as the > metadata object is not empty. But when passing an empty metadata object, the > final column in the resulting DataFrame will still hold the metadata of the > original incoming column, i.e. you cannot use this method to essentially > reset the metadata of a column. > This is not the expected behaviour and has changed in Spark 3.3.0. In Spark > 3.2.1 and earlier, passing an empty metadata object to the method > "Column.as(name, metadata)" resets the columns metadata as expected. > h2. Steps to Reproduce > The following code snippet will show the issue in Spark shell: > {code:scala} > import org.apache.spark.sql.types.MetadataBuilder > // Create a DataFrame with one column with Metadata attached > val df1 = spark.range(1,10) > .withColumn("col_with_metadata", col("id").as("col_with_metadata", new > MetadataBuilder().putString("metadata", "value").build())) > // Create a derived DataFrame which should reset the metadata of the column > val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new > MetadataBuilder().build())) > // Display metadata of both DataFrames columns > println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") > println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") > {code} > This code results in the following lines printed onto the console > {code} > df1 metadata: {"metadata":"value"} > df2 metadata: {"metadata":"value"} > {code} > This result does not meet my expectations. I expect that df1 has non-empty > metadata, but df2 should have empty metadata. But this is not the case, df2 > still holds the same metadata as df1. > h2. Analysis > I think the problem stems from the changes in the method > "trimNonTopLevelAliases" in the class AliasHelper: > {code:scala} > protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { > val res = e match { > case a: Alias => > val metadata = if (a.metadata == Metadata.empty) { > None > } else { > Some(a.metadata) > } > a.copy(child = trimAliases(a.child))( > exprId = a.exprId, > qualifier = a.qualifier, > explicitMetadata = metadata, > nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) > case a: MultiAlias => > a.copy(child = trimAliases(a.child)) > case other => trimAliases(other) > } > res.asInstanceOf[T] > } > {code} > The method will remove any empty metadata object from an Alias, which in turn > means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata
[ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569843#comment-17569843 ] Apache Spark commented on SPARK-39838: -- User 'kupferk' has created a pull request for this issue: https://github.com/apache/spark/pull/37251 > Passing an empty Metadata object to Column.as() should clear the metadata > - > > Key: SPARK-39838 > URL: https://issues.apache.org/jira/browse/SPARK-39838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kaya Kupferschmidt >Priority: Major > > h2. Description > The Spark DataFrame API allows developers to attach arbiotrary metadata to > individual columns as key/value pairs. The attachment is performed via the > method "Column.as(name, metadata)". This works as expected, as long as the > metadata object is not empty. But when passing an empty metadata object, the > final column in the resulting DataFrame will still hold the metadata of the > original incoming column, i.e. you cannot use this method to essentially > reset the metadata of a column. > This is not the expected behaviour and has changed in Spark 3.3.0. In Spark > 3.2.1 and earlier, passing an empty metadata object to the method > "Column.as(name, metadata)" resets the columns metadata as expected. > h2. Steps to Reproduce > The following code snippet will show the issue in Spark shell: > {code:scala} > import org.apache.spark.sql.types.MetadataBuilder > // Create a DataFrame with one column with Metadata attached > val df1 = spark.range(1,10) > .withColumn("col_with_metadata", col("id").as("col_with_metadata", new > MetadataBuilder().putString("metadata", "value").build())) > // Create a derived DataFrame which should reset the metadata of the column > val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new > MetadataBuilder().build())) > // Display metadata of both DataFrames columns > println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") > println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") > {code} > This code results in the following lines printed onto the console > {code} > df1 metadata: {"metadata":"value"} > df2 metadata: {"metadata":"value"} > {code} > This result does not meet my expectations. I expect that df1 has non-empty > metadata, but df2 should have empty metadata. But this is not the case, df2 > still holds the same metadata as df1. > h2. Analysis > I think the problem stems from the changes in the method > "trimNonTopLevelAliases" in the class AliasHelper: > {code:scala} > protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { > val res = e match { > case a: Alias => > val metadata = if (a.metadata == Metadata.empty) { > None > } else { > Some(a.metadata) > } > a.copy(child = trimAliases(a.child))( > exprId = a.exprId, > qualifier = a.qualifier, > explicitMetadata = metadata, > nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) > case a: MultiAlias => > a.copy(child = trimAliases(a.child)) > case other => trimAliases(other) > } > res.asInstanceOf[T] > } > {code} > The method will remove any empty metadata object from an Alias, which in turn > means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata
[ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39838: Assignee: Apache Spark > Passing an empty Metadata object to Column.as() should clear the metadata > - > > Key: SPARK-39838 > URL: https://issues.apache.org/jira/browse/SPARK-39838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kaya Kupferschmidt >Assignee: Apache Spark >Priority: Major > > h2. Description > The Spark DataFrame API allows developers to attach arbiotrary metadata to > individual columns as key/value pairs. The attachment is performed via the > method "Column.as(name, metadata)". This works as expected, as long as the > metadata object is not empty. But when passing an empty metadata object, the > final column in the resulting DataFrame will still hold the metadata of the > original incoming column, i.e. you cannot use this method to essentially > reset the metadata of a column. > This is not the expected behaviour and has changed in Spark 3.3.0. In Spark > 3.2.1 and earlier, passing an empty metadata object to the method > "Column.as(name, metadata)" resets the columns metadata as expected. > h2. Steps to Reproduce > The following code snippet will show the issue in Spark shell: > {code:scala} > import org.apache.spark.sql.types.MetadataBuilder > // Create a DataFrame with one column with Metadata attached > val df1 = spark.range(1,10) > .withColumn("col_with_metadata", col("id").as("col_with_metadata", new > MetadataBuilder().putString("metadata", "value").build())) > // Create a derived DataFrame which should reset the metadata of the column > val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new > MetadataBuilder().build())) > // Display metadata of both DataFrames columns > println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") > println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") > {code} > This code results in the following lines printed onto the console > {code} > df1 metadata: {"metadata":"value"} > df2 metadata: {"metadata":"value"} > {code} > This result does not meet my expectations. I expect that df1 has non-empty > metadata, but df2 should have empty metadata. But this is not the case, df2 > still holds the same metadata as df1. > h2. Analysis > I think the problem stems from the changes in the method > "trimNonTopLevelAliases" in the class AliasHelper: > {code:scala} > protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { > val res = e match { > case a: Alias => > val metadata = if (a.metadata == Metadata.empty) { > None > } else { > Some(a.metadata) > } > a.copy(child = trimAliases(a.child))( > exprId = a.exprId, > qualifier = a.qualifier, > explicitMetadata = metadata, > nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) > case a: MultiAlias => > a.copy(child = trimAliases(a.child)) > case other => trimAliases(other) > } > res.asInstanceOf[T] > } > {code} > The method will remove any empty metadata object from an Alias, which in turn > means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata
[ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569842#comment-17569842 ] Kaya Kupferschmidt commented on SPARK-39838: Please find a PR at https://github.com/apache/spark/pull/37251 > Passing an empty Metadata object to Column.as() should clear the metadata > - > > Key: SPARK-39838 > URL: https://issues.apache.org/jira/browse/SPARK-39838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kaya Kupferschmidt >Priority: Major > > h2. Description > The Spark DataFrame API allows developers to attach arbiotrary metadata to > individual columns as key/value pairs. The attachment is performed via the > method "Column.as(name, metadata)". This works as expected, as long as the > metadata object is not empty. But when passing an empty metadata object, the > final column in the resulting DataFrame will still hold the metadata of the > original incoming column, i.e. you cannot use this method to essentially > reset the metadata of a column. > This is not the expected behaviour and has changed in Spark 3.3.0. In Spark > 3.2.1 and earlier, passing an empty metadata object to the method > "Column.as(name, metadata)" resets the columns metadata as expected. > h2. Steps to Reproduce > The following code snippet will show the issue in Spark shell: > {code:scala} > import org.apache.spark.sql.types.MetadataBuilder > // Create a DataFrame with one column with Metadata attached > val df1 = spark.range(1,10) > .withColumn("col_with_metadata", col("id").as("col_with_metadata", new > MetadataBuilder().putString("metadata", "value").build())) > // Create a derived DataFrame which should reset the metadata of the column > val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new > MetadataBuilder().build())) > // Display metadata of both DataFrames columns > println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") > println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") > {code} > This code results in the following lines printed onto the console > {code} > df1 metadata: {"metadata":"value"} > df2 metadata: {"metadata":"value"} > {code} > This result does not meet my expectations. I expect that df1 has non-empty > metadata, but df2 should have empty metadata. But this is not the case, df2 > still holds the same metadata as df1. > h2. Analysis > I think the problem stems from the changes in the method > "trimNonTopLevelAliases" in the class AliasHelper: > {code:scala} > protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { > val res = e match { > case a: Alias => > val metadata = if (a.metadata == Metadata.empty) { > None > } else { > Some(a.metadata) > } > a.copy(child = trimAliases(a.child))( > exprId = a.exprId, > qualifier = a.qualifier, > explicitMetadata = metadata, > nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) > case a: MultiAlias => > a.copy(child = trimAliases(a.child)) > case other => trimAliases(other) > } > res.asInstanceOf[T] > } > {code} > The method will remove any empty metadata object from an Alias, which in turn > means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata
[ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaya Kupferschmidt updated SPARK-39838: --- Description: h2. Description The Spark DataFrame API allows developers to attach arbiotrary metadata to individual columns as key/value pairs. The attachment is performed via the method "Column.as(name, metadata)". This works as expected, as long as the metadata object is not empty. But when passing an empty metadata object, the final column in the resulting DataFrame will still hold the metadata of the original incoming column, i.e. you cannot use this method to essentially reset the metadata of a column. This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 3.2.1 and earlier, passing an empty metadata object to the method "Column.as(name, metadata)" resets the columns metadata as expected. h2. Steps to Reproduce The following code snippet will show the issue in Spark shell: {code:scala} import org.apache.spark.sql.types.MetadataBuilder // Create a DataFrame with one column with Metadata attached val df1 = spark.range(1,10) .withColumn("col_with_metadata", col("id").as("col_with_metadata", new MetadataBuilder().putString("metadata", "value").build())) // Create a derived DataFrame which should reset the metadata of the column val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new MetadataBuilder().build())) // Display metadata of both DataFrames columns println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") {code} This code results in the following lines printed onto the console {code} df1 metadata: {"metadata":"value"} df2 metadata: {"metadata":"value"} {code} This result does not meet my expectations. I expect that df1 has non-empty metadata, but df2 should have empty metadata. But this is not the case, df2 still holds the same metadata as df1. h2. Analysis I think the problem stems from the changes in the method "trimNonTopLevelAliases" in the class AliasHelper: {code:scala} protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { val res = e match { case a: Alias => val metadata = if (a.metadata == Metadata.empty) { None } else { Some(a.metadata) } a.copy(child = trimAliases(a.child))( exprId = a.exprId, qualifier = a.qualifier, explicitMetadata = metadata, nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) case a: MultiAlias => a.copy(child = trimAliases(a.child)) case other => trimAliases(other) } res.asInstanceOf[T] } {code} The method will remove any empty metadata object from an Alias, which in turn means that Alias will inherit its childs metadata. was: h2. Description The Spark DataFrame API allows developers to attach arbiotrary metadata to individual columns as key/value pairs. The attachment is performed via the method "Column.as(name, metadata)". This works as expected, as long as the metadata object is not empty. But when passing an empty metadata object, the final column in the resulting DataFrame will still hold the metadata of the original incoming column, i.e. you cannot use this method to essentially reset the metadata of a column. This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 3.2.1 and earlier, passing an empty metadata object to the method "Column.as(name, metadata)" resets the columns metadata as expected. h2. Steps to Reproduce The following code snippet will show the issue in Spark shell: {code:scala} import org.apache.spark.sql.types.MetadataBuilder // Create a DataFrame with one column with Metadata attached val df1 = spark.range(1,10) .withColumn("col_with_metadata", col("id").as("col_with_metadata", new MetadataBuilder().putString("metadata", "value").build())) // Create a derived DataFrame which should reset the metadata of the column val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new MetadataBuilder().build())) // Display metadata of both DataFrames columns println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") {code} Expected would be that df1 has non-empty metadata, but df2 has empty metadata. But this is not the case, df2 still holds the same metadata as df1. h2. Analysis I think the problem stems from the changes in the method "trimNonTopLevelAliases" in the class AliasHelper: {code:scala} protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { val res = e match { case a: Alias => val metadata = if (a.metadata == Metadata.empty) { None } else { Some(a.metadata) } a.copy(child = trimAliases(a.child))(
[jira] [Created] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata
Kaya Kupferschmidt created SPARK-39838: -- Summary: Passing an empty Metadata object to Column.as() should clear the metadata Key: SPARK-39838 URL: https://issues.apache.org/jira/browse/SPARK-39838 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kaya Kupferschmidt h2. Description The Spark DataFrame API allows developers to attach arbiotrary metadata to individual columns as key/value pairs. The attachment is performed via the method "Column.as(name, metadata)". This works as expected, as long as the metadata object is not empty. But when passing an empty metadata object, the final column in the resulting DataFrame will still hold the metadata of the original incoming column, i.e. you cannot use this method to essentially reset the metadata of a column. This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 3.2.1 and earlier, passing an empty metadata object to the method "Column.as(name, metadata)" resets the columns metadata as expected. h2. Steps to Reproduce The following code snippet will show the issue in Spark shell: {code:scala} import org.apache.spark.sql.types.MetadataBuilder // Create a DataFrame with one column with Metadata attached val df1 = spark.range(1,10) .withColumn("col_with_metadata", col("id").as("col_with_metadata", new MetadataBuilder().putString("metadata", "value").build())) // Create a derived DataFrame which should reset the metadata of the column val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new MetadataBuilder().build())) // Display metadata of both DataFrames columns println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") {code} Expected would be that df1 has non-empty metadata, but df2 has empty metadata. But this is not the case, df2 still holds the same metadata as df1. h2. Analysis I think the problem stems from the changes in the method "trimNonTopLevelAliases" in the class AliasHelper: {code:scala} protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { val res = e match { case a: Alias => val metadata = if (a.metadata == Metadata.empty) { None } else { Some(a.metadata) } a.copy(child = trimAliases(a.child))( exprId = a.exprId, qualifier = a.qualifier, explicitMetadata = metadata, nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) case a: MultiAlias => a.copy(child = trimAliases(a.child)) case other => trimAliases(other) } res.asInstanceOf[T] } {code} The method will remove any empty metadata object from an Alias, which in turn means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort
[ https://issues.apache.org/jira/browse/SPARK-39835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39835: Assignee: Apache Spark > Fix EliminateSorts remove global sort below the local sort > -- > > Key: SPARK-39835 > URL: https://issues.apache.org/jira/browse/SPARK-39835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > If a global sort below locol sort, we should not remove the global sort > becuase the output partitioning can be affected. > This issue is going to worse since we pull out the V1 Write sort to logcial > side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort
[ https://issues.apache.org/jira/browse/SPARK-39835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569773#comment-17569773 ] Apache Spark commented on SPARK-39835: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37250 > Fix EliminateSorts remove global sort below the local sort > -- > > Key: SPARK-39835 > URL: https://issues.apache.org/jira/browse/SPARK-39835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > If a global sort below locol sort, we should not remove the global sort > becuase the output partitioning can be affected. > This issue is going to worse since we pull out the V1 Write sort to logcial > side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort
[ https://issues.apache.org/jira/browse/SPARK-39835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39835: Assignee: (was: Apache Spark) > Fix EliminateSorts remove global sort below the local sort > -- > > Key: SPARK-39835 > URL: https://issues.apache.org/jira/browse/SPARK-39835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > If a global sort below locol sort, we should not remove the global sort > becuase the output partitioning can be affected. > This issue is going to worse since we pull out the V1 Write sort to logcial > side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39837) Filesystem leak when running `TPC-DS queries with SF=1`
[ https://issues.apache.org/jira/browse/SPARK-39837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39837: - Description: Following log in `TPC-DS queries with SF=1` GA logs: {code:java} 2022-07-22T00:19:52.8539664Z 00:19:52.849 WARN org.apache.spark.DebugFilesystem: Leaked filesystem connection created at: 2022-07-22T00:19:52.8548926Z java.lang.Throwable 2022-07-22T00:19:52.8568135Zat org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) 2022-07-22T00:19:52.8573547Zat org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) 2022-07-22T00:19:52.8574108Zat org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) 2022-07-22T00:19:52.8578427Zat org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) 2022-07-22T00:19:52.8579211Zat org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774) 2022-07-22T00:19:52.8589698Zat org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100) 2022-07-22T00:19:52.8590842Zat org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:175) 2022-07-22T00:19:52.8594751Zat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:340) 2022-07-22T00:19:52.8595634Zat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:211) 2022-07-22T00:19:52.8598975Zat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:272) 2022-07-22T00:19:52.8599639Zat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118) 2022-07-22T00:19:52.8602839Zat org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:583) 2022-07-22T00:19:52.8603625Zat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source) 2022-07-22T00:19:52.8606618Zat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source) 2022-07-22T00:19:52.8609954Zat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 2022-07-22T00:19:52.8620028Zat org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) 2022-07-22T00:19:52.8623148Zat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2022-07-22T00:19:52.8623812Zat org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) 2022-07-22T00:19:52.8627344Zat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) 2022-07-22T00:19:52.8628031Zat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) 2022-07-22T00:19:52.8637881Zat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 2022-07-22T00:19:52.8638603Zat org.apache.spark.scheduler.Task.run(Task.scala:139) 2022-07-22T00:19:52.8644696Zat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) 2022-07-22T00:19:52.8645352Zat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490) 2022-07-22T00:19:52.8649598Zat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) 2022-07-22T00:19:52.8650238Zat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2022-07-22T00:19:52.8657783Zat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2022-07-22T00:19:52.8658260Zat java.lang.Thread.run(Thread.java:750){code} Actions have similar to log: * [https://github.com/apache/spark/runs/7460003953?check_suite_focus=true] * [https://github.com/apache/spark/runs/7459868605?check_suite_focus=true] * [https://github.com/apache/spark/runs/7460262731?check_suite_focus=true] was: Following log in `TPC-DS queries with SF=1` GA logs: {code:java} 2022-07-22T00:48:19.8046575Z 00:48:19.800 WARN org.apache.spark.DebugFilesystem: Leaked filesystem connection created at: 2022-07-22T00:48:19.8183197Z java.lang.Throwable 2022-07-22T00:48:19.8209541Zat org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) 2022-07-22T00:48:19.8364870Zat org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) 2022-07-22T00:48:19.8429477Zat org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) 2022-07-22T00:48:19.8440381Zat org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) 2022-07-22T00:48:19.8463114Zat org.apache.parquet.hadoop.ParquetFil
[jira] [Created] (SPARK-39837) Filesystem leak when running `TPC-DS queries with SF=1`
Yang Jie created SPARK-39837: Summary: Filesystem leak when running `TPC-DS queries with SF=1` Key: SPARK-39837 URL: https://issues.apache.org/jira/browse/SPARK-39837 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.4.0 Reporter: Yang Jie Following log in `TPC-DS queries with SF=1` GA logs: {code:java} 2022-07-22T00:48:19.8046575Z 00:48:19.800 WARN org.apache.spark.DebugFilesystem: Leaked filesystem connection created at: 2022-07-22T00:48:19.8183197Z java.lang.Throwable 2022-07-22T00:48:19.8209541Zat org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) 2022-07-22T00:48:19.8364870Zat org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) 2022-07-22T00:48:19.8429477Zat org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) 2022-07-22T00:48:19.8440381Zat org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) 2022-07-22T00:48:19.8463114Zat org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774) 2022-07-22T00:48:19.8483110Zat org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100) 2022-07-22T00:48:19.8492740Zat org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:175) 2022-07-22T00:48:19.8507149Zat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:340) 2022-07-22T00:48:19.8525518Zat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:211) 2022-07-22T00:48:19.8536791Zat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:272) 2022-07-22T00:48:19.8542997Zat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118) 2022-07-22T00:48:19.8548773Zat org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:583) 2022-07-22T00:48:19.8552000Zat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source) 2022-07-22T00:48:19.8561197Zat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source) 2022-07-22T00:48:19.8564920Zat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 2022-07-22T00:48:19.8570921Zat org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) 2022-07-22T00:48:19.8578211Zat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2022-07-22T00:48:19.8581739Zat org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) 2022-07-22T00:48:19.8588053Zat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) 2022-07-22T00:48:19.8591953Zat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) 2022-07-22T00:48:19.8599896Zat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 2022-07-22T00:48:19.8605778Zat org.apache.spark.scheduler.Task.run(Task.scala:139) 2022-07-22T00:48:19.8609467Zat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) 2022-07-22T00:48:19.8610083Zat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490) 2022-07-22T00:48:19.8614645Zat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) 2022-07-22T00:48:19.8616327Zat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2022-07-22T00:48:19.8620080Zat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2022-07-22T00:48:19.8620695Zat java.lang.Thread.run(Thread.java:750) {code} Actions have similar to log: * [https://github.com/apache/spark/runs/7460003953?check_suite_focus=true] * [https://github.com/apache/spark/runs/7459868605?check_suite_focus=true] * https://github.com/apache/spark/runs/7460262731?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.
[ https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569771#comment-17569771 ] Apache Spark commented on SPARK-39836: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37249 > Simplify V2ExpressionBuilder by extract common method. > -- > > Key: SPARK-39836 > URL: https://issues.apache.org/jira/browse/SPARK-39836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2ExpressionBuilder have a lot of similar code, we can extract > them as one common method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.
[ https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39836: Assignee: Apache Spark > Simplify V2ExpressionBuilder by extract common method. > -- > > Key: SPARK-39836 > URL: https://issues.apache.org/jira/browse/SPARK-39836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Currently, V2ExpressionBuilder have a lot of similar code, we can extract > them as one common method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.
[ https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39836: Assignee: (was: Apache Spark) > Simplify V2ExpressionBuilder by extract common method. > -- > > Key: SPARK-39836 > URL: https://issues.apache.org/jira/browse/SPARK-39836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2ExpressionBuilder have a lot of similar code, we can extract > them as one common method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.
[ https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569770#comment-17569770 ] Apache Spark commented on SPARK-39836: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37249 > Simplify V2ExpressionBuilder by extract common method. > -- > > Key: SPARK-39836 > URL: https://issues.apache.org/jira/browse/SPARK-39836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, V2ExpressionBuilder have a lot of similar code, we can extract > them as one common method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.
jiaan.geng created SPARK-39836: -- Summary: Simplify V2ExpressionBuilder by extract common method. Key: SPARK-39836 URL: https://issues.apache.org/jira/browse/SPARK-39836 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Currently, V2ExpressionBuilder have a lot of similar code, we can extract them as one common method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group column)
[ https://issues.apache.org/jira/browse/SPARK-39819 ] jiaan.geng deleted comment on SPARK-39819: was (Author: beliefer): I'm working on. > DS V2 aggregate push down can work with Top N or Paging (Sort with group > column) > > > Key: SPARK-39819 > URL: https://issues.apache.org/jira/browse/SPARK-39819 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, DS V2 aggregate push-down cannot work with Top N (order by ... > limit ...) or Paging (order by ... limit ... offset ...). > If it can work with Top N or Paging, it will be better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort
XiDuo You created SPARK-39835: - Summary: Fix EliminateSorts remove global sort below the local sort Key: SPARK-39835 URL: https://issues.apache.org/jira/browse/SPARK-39835 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You If a global sort below locol sort, we should not remove the global sort becuase the output partitioning can be affected. This issue is going to worse since we pull out the V1 Write sort to logcial side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39831: Assignee: Yikun Jiang > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39831. -- Fix Version/s: 3.3.1 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37247 [https://github.com/apache/spark/pull/37247] > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39834: Assignee: Apache Spark > Include the origin stats and constraints for LogicalRDD if it comes from > DataFrame > -- > > Key: SPARK-39834 > URL: https://issues.apache.org/jira/browse/SPARK-39834 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it > comes from DataFrame, to achieve carrying-over stats as well as providing > information to possibly connect two disconnected logical plans into one. > After we introduced the change, we figured out several issues: > 1. One of major use case for DataFrame.checkpoint is ML, especially > "iterative algorithm", which purpose is to "prune" the logical plan. That is > against the purpose of including origin logical plan and we have a risk to > have nested LogicalRDDs which grows the size of logical plan infinitely. > 2. We leverage logical plan to carry over stats, but the correct stats > information is in optimized plan. > 3. (Not an issue but missing spot) constraints is also something we can carry > over. > To address above issues, it would be better if we include stats and > constraints in LogicalRDD rather than logical plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39834: Assignee: (was: Apache Spark) > Include the origin stats and constraints for LogicalRDD if it comes from > DataFrame > -- > > Key: SPARK-39834 > URL: https://issues.apache.org/jira/browse/SPARK-39834 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it > comes from DataFrame, to achieve carrying-over stats as well as providing > information to possibly connect two disconnected logical plans into one. > After we introduced the change, we figured out several issues: > 1. One of major use case for DataFrame.checkpoint is ML, especially > "iterative algorithm", which purpose is to "prune" the logical plan. That is > against the purpose of including origin logical plan and we have a risk to > have nested LogicalRDDs which grows the size of logical plan infinitely. > 2. We leverage logical plan to carry over stats, but the correct stats > information is in optimized plan. > 3. (Not an issue but missing spot) constraints is also something we can carry > over. > To address above issues, it would be better if we include stats and > constraints in LogicalRDD rather than logical plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569762#comment-17569762 ] Apache Spark commented on SPARK-39834: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37248 > Include the origin stats and constraints for LogicalRDD if it comes from > DataFrame > -- > > Key: SPARK-39834 > URL: https://issues.apache.org/jira/browse/SPARK-39834 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it > comes from DataFrame, to achieve carrying-over stats as well as providing > information to possibly connect two disconnected logical plans into one. > After we introduced the change, we figured out several issues: > 1. One of major use case for DataFrame.checkpoint is ML, especially > "iterative algorithm", which purpose is to "prune" the logical plan. That is > against the purpose of including origin logical plan and we have a risk to > have nested LogicalRDDs which grows the size of logical plan infinitely. > 2. We leverage logical plan to carry over stats, but the correct stats > information is in optimized plan. > 3. (Not an issue but missing spot) constraints is also something we can carry > over. > To address above issues, it would be better if we include stats and > constraints in LogicalRDD rather than logical plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
Jungtaek Lim created SPARK-39834: Summary: Include the origin stats and constraints for LogicalRDD if it comes from DataFrame Key: SPARK-39834 URL: https://issues.apache.org/jira/browse/SPARK-39834 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it comes from DataFrame, to achieve carrying-over stats as well as providing information to possibly connect two disconnected logical plans into one. After we introduced the change, we figured out several issues: 1. One of major use case for DataFrame.checkpoint is ML, especially "iterative algorithm", which purpose is to "prune" the logical plan. That is against the purpose of including origin logical plan and we have a risk to have nested LogicalRDDs which grows the size of logical plan infinitely. 2. We leverage logical plan to carry over stats, but the correct stats information is in optimized plan. 3. (Not an issue but missing spot) constraints is also something we can carry over. To address above issues, it would be better if we include stats and constraints in LogicalRDD rather than logical plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569735#comment-17569735 ] Apache Spark commented on SPARK-39831: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37247 > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39826) Bump scalatest-maven-plugin to 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-39826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39826. -- Resolution: Fixed Issue resolved by pull request 37237 [https://github.com/apache/spark/pull/37237] > Bump scalatest-maven-plugin to 2.1.0 > > > Key: SPARK-39826 > URL: https://issues.apache.org/jira/browse/SPARK-39826 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39826) Bump scalatest-maven-plugin to 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-39826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39826: Assignee: BingKun Pan > Bump scalatest-maven-plugin to 2.1.0 > > > Key: SPARK-39826 > URL: https://issues.apache.org/jira/browse/SPARK-39826 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569725#comment-17569725 ] Hyukjin Kwon commented on SPARK-39831: -- Reverted in: https://github.com/apache/spark/commit/2bec66177de36d449dd6adebd8b6dd227ef40726 https://github.com/apache/spark/commit/248f34e46d591396f32bed79730a7b5b3141e7e9 https://github.com/apache/spark/commit/f344bf97265306b50ab79e465535054e2d582877 https://github.com/apache/spark/commit/b54d985223e07963db4b62a00dd29ebd012382ad > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39831: - Fix Version/s: (was: 3.1.4) (was: 3.4.0) (was: 3.3.1) (was: 3.2.3) > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39622. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37245 [https://github.com/apache/spark/pull/37245] > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > at scala.collection.immuta
[jira] [Assigned] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39622: - Assignee: Yang Jie > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Yang Jie >Priority: Major > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > at scala.collection.immutable.List.foreach(List.scala:431) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at org.scalatest.Supe
[jira] [Created] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true
Michael Allman created SPARK-39833: -- Summary: Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true Key: SPARK-39833 URL: https://issues.apache.org/jira/browse/SPARK-39833 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: Michael Allman One of our data scientists discovered a problem wherein a data frame `.show()` call printed non-empty results, but `.count()` printed 0. I've narrowed the issue to a small, reproducible test case which exhibits this aberrant behavior. In pyspark, run the following code: {code:python} from pyspark.sql.types import * parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], schema=StructType(fields=[StructField("COL0",IntegerType(),True)])) parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet") reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug") reread_parquet_pushdown_bug_df.filter("col0 = 0").show() print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count()) {code} In my usage, this prints a data frame with 1 row and a count of 0. However, disabling `spark.sql.parquet.filterPushdown` produces consistent results: {code:python} spark.conf.set("spark.sql.parquet.filterPushdown", False) reread_parquet_pushdown_bug_df.filter("col0 = 0").show() reread_parquet_pushdown_bug_df.filter("col0 = 0").count() {code} This will print the same data frame, however it will print a count of 1. The key to triggering this bug is not just enabling `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of the column in the data frame (before writing) must differ from the case of the partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39832) regexp_replace should support column arguments
Brian Schaefer created SPARK-39832: -- Summary: regexp_replace should support column arguments Key: SPARK-39832 URL: https://issues.apache.org/jira/browse/SPARK-39832 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Brian Schaefer {{F.regexp_replace}} in PySpark currently only supports strings for the second and third argument: [https://github.com/apache/spark/blob/1df6006ea977ae3b8c53fe33630e277e8c1bc49c/python/pyspark/sql/functions.py#L3265] In Scala, columns are also supported: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2836|https://github.com/apache/spark/blob/1df6006ea977ae3b8c53fe33630e277e8c1bc49c/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2836] The desire to use columns as arguments for the function has been raised previously on StackExchange: [https://stackoverflow.com/questions/64613761/in-pyspark-using-regexp-replace-how-to-replace-a-group-with-value-from-another|https://stackoverflow.com/questions/64613761/in-pyspark-using-regexp-replace-how-to-replace-a-group-with-value-from-another,], where the suggested fix was to use {{{}F.expr{}}}. It should be relatively straightforward to support in PySpark the two function signatures supported in Scala. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569605#comment-17569605 ] Dongjoon Hyun commented on SPARK-39830: --- Thank you, [~dzcxzl] . > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Priority: Trivial > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39822) Provides a good error during create Index with different dtype elements
[ https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39822: - Parent: SPARK-39581 Issue Type: Sub-task (was: Bug) > Provides a good error during create Index with different dtype elements > --- > > Key: SPARK-39822 > URL: https://issues.apache.org/jira/browse/SPARK-39822 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.2 >Reporter: bo zhao >Priority: Minor > > PANDAS > > {code:java} > >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], > >>> dtype='object') >>> > {code} > PYSPARK > > > {code:java} > Using Python version 3.8.13 (default, Jun 29 2022 11:50:19) > Spark context Web UI available at http://172.25.179.45:4042 > Spark context available as 'sc' (master = local[*], app id = > local-1658301116572). > SparkSession available as 'spark'. > >>> from pyspark import pandas as ps > WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It > is required to set this environment variable to '1' in both driver and > executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you > but it does not work if there is a Spark context already launched. > >>> ps.Index([1,2,'3',4]) > Traceback (most recent call last): > File "", line 1, in > File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, > in __new__ > ps.from_pandas( > File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in > from_pandas > return DataFrame(pd.DataFrame(index=pobj)).index > File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in > __init__ > internal = InternalFrame.from_pandas(pdf) > File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in > from_pandas > ) = InternalFrame.prepare_pandas_frame(pdf, > prefer_timestamp_ntz=prefer_timestamp_ntz) > File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in > prepare_pandas_frame > spark_type = infer_pd_series_spark_type(reset_index[col], dtype, > prefer_timestamp_ntz) > File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 360, in infer_pd_series_spark_type > return from_arrow_type(pa.Array.from_pandas(pser).type, > prefer_timestamp_ntz) > File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 312, in pyarrow.lib.array > File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to > convert to int64 > {code} > I understand that pyspark pandas need the dtype to be the same, but we need a > good error msg or something to tell the user how to avoid. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39791) In Spark 3.0 standalone cluster mode, unable to customize driver JVM path
[ https://issues.apache.org/jira/browse/SPARK-39791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569428#comment-17569428 ] Hyukjin Kwon commented on SPARK-39791: -- Is this a regression? > In Spark 3.0 standalone cluster mode, unable to customize driver JVM path > - > > Key: SPARK-39791 > URL: https://issues.apache.org/jira/browse/SPARK-39791 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.0.0 >Reporter: Obobj >Priority: Minor > Labels: spark-submit, standalone > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > In Spark 3.0 standalone mode, unable to customize driver JVM path, instead > the JAVA_HOME of the spark-submit submission machine is used, but the JVM > paths of my submission machine and the cluster machine are different > {code:java} > launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java > List buildJavaCommand(String extraClassPath) throws IOException { > List cmd = new ArrayList<>(); > String firstJavaHome = firstNonEmpty(javaHome, > childEnv.get("JAVA_HOME"), > System.getenv("JAVA_HOME"), > System.getProperty("java.home")); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39815) ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-39815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569426#comment-17569426 ] Hyukjin Kwon commented on SPARK-39815: -- Does this cause any actual issue? or just error log? > ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > > > Key: SPARK-39815 > URL: https://issues.apache.org/jira/browse/SPARK-39815 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 > Environment: Ubuntu 20.04 > Python 3.8.10 > Java 8 > >Reporter: Marzieh >Priority: Major > > h3. I have a Spark SQL Program which run in Spark Cluster. Even though the > program is finished without any Error, after finishing State of the > Application becomes Killed. It shows this error on the log: > > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10686 > 22/07/19 13:14:50 INFO Executor: Running task 232.0 in stage 137.0 (TID 10686) > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10687 > 22/07/19 13:14:50 INFO Executor: Running task 233.0 in stage 137.0 (TID 10687) > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 6, boot = 2, init = 4, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 232.0 in stage 137.0 (TID > 10686). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 10, boot = 9, init = 1, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 233.0 in stage 137.0 (TID > 10687). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10688 > 22/07/19 13:14:50 INFO Executor: Running task 234.0 in stage 137.0 (TID 10688) > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10689 > 22/07/19 13:14:50 INFO Executor: Running task 235.0 in stage 137.0 (TID 10689) > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 1, boot = 1, init = 0, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 235.0 in stage 137.0 (TID > 10689). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10690 > 22/07/19 13:14:50 INFO Executor: Running task 236.0 in stage 137.0 (TID 10690) > 22/07/19 13:14:50 WARN JdbcUtils: Requested isolation level 1 is not > supported; falling back to default isolation level > 2 22/07/19 13:14:50 INFO PythonRunner: Times: total = 42, boot = -13, init = > 55, finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 231.0 in stage 137.0 (TID > 10685). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10691 > 22/07/19 13:14:50 INFO Executor: Running task 237.0 in stage 137.0 (TID 10691) > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 43, boot = -4, init = 47, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 234.0 in stage 137.0 (TID > 10688). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10692 > 22/07/19 13:14:50 INFO Executor: Running task 238.0 in stage 137.0 (TID 10692) > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 43, boot = 2, init = 41, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 236.0 in stage 137.0 (TID > 10690). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10693 > 22/07/19 13:14:50 INFO Executor: Running task 239.0 in stage 137.0 (TID 10693) > 22/07/19 13:14:50 INFO JDBCRDD: closed connection 22/07/19 13:14:50 INFO > PythonRunner: Times: total = 44, boot = 3, init = 41, finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 237.0 in stage 137.0 (TID > 10691). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 44, boot = 2, init = 42, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 238.0 in stage 137.0 (TID > 10692). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO Executor: Finished task 219.0 in stage 137.0 (TID > 10673). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO PythonRunner: Times: total = 42, boot = 2, init = 40, > finish = 0 > 22/07/19 13:14:50 INFO Executor: Finished task 239.0 in stage 137.0 (TID > 10693). 1785 bytes result sent to driver > 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Driver commanded a > shutdown > 22/07/19 13:14:50 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39817) Missing sbin scripts in PySpark packages
[ https://issues.apache.org/jira/browse/SPARK-39817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569425#comment-17569425 ] Hyukjin Kwon commented on SPARK-39817: -- pip is designed for using it in Python. I would prefer to avoid people to create a Spark cluster by using pip. > Missing sbin scripts in PySpark packages > > > Key: SPARK-39817 > URL: https://issues.apache.org/jira/browse/SPARK-39817 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: F. H. >Priority: Major > Labels: easyfix > Original Estimate: 5m > Remaining Estimate: 5m > > In the PySpark setup.py, only a subset of all scripts is included. > I'm in particular missing the `submit-all.sh` script: > {code:python} > package_data={ > 'pyspark.jars': ['*.jar'], > 'pyspark.bin': ['*'], > 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh', > 'start-history-server.sh', > 'stop-history-server.sh', ], > [...] > }, > {code} > > The solution is super simple, just change 'pyspark.sbin' to: > {code:python} > 'pyspark.sbin': ['*'], > {code} > > I would happily submit a PR to github, but I have no clue on the > organizational details. > This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39829) Upgrade log4j2 to 2.18.0
[ https://issues.apache.org/jira/browse/SPARK-39829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39829: Assignee: Dongjoon Hyun > Upgrade log4j2 to 2.18.0 > > > Key: SPARK-39829 > URL: https://issues.apache.org/jira/browse/SPARK-39829 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39829) Upgrade log4j2 to 2.18.0
[ https://issues.apache.org/jira/browse/SPARK-39829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39829. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37242 [https://github.com/apache/spark/pull/37242] > Upgrade log4j2 to 2.18.0 > > > Key: SPARK-39829 > URL: https://issues.apache.org/jira/browse/SPARK-39829 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39831: - Fix Version/s: 3.1.4 3.4.0 3.3.1 3.2.3 > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569399#comment-17569399 ] Apache Spark commented on SPARK-7837: - User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37245 > NPE when save as parquet in speculative tasks > - > > Key: SPARK-7837 > URL: https://issues.apache.org/jira/browse/SPARK-7837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Yin Huai >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.5.0 > > > The query is like {{df.orderBy(...).saveAsTable(...)}}. > When there is no partitioning columns and there is a skewed key, I found the > following exception in speculative tasks. After these failures, seems we > could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. > {code} > java.lang.NullPointerException > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) > at > org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569397#comment-17569397 ] Apache Spark commented on SPARK-7837: - User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37245 > NPE when save as parquet in speculative tasks > - > > Key: SPARK-7837 > URL: https://issues.apache.org/jira/browse/SPARK-7837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Yin Huai >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.5.0 > > > The query is like {{df.orderBy(...).saveAsTable(...)}}. > When there is no partitioning columns and there is a skewed key, I found the > following exception in speculative tasks. After these failures, seems we > could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. > {code} > java.lang.NullPointerException > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) > at > org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39622: Assignee: (was: Apache Spark) > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > at scala.collection.immutable.List.foreach(List.scala:431) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at org.scalatest.SuperEngine.runTestsInB
[jira] [Commented] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569396#comment-17569396 ] Apache Spark commented on SPARK-39622: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37245 > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > at scala.collection.immutable.List.foreach(List.scala:431) >
[jira] [Assigned] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39622: Assignee: Apache Spark > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > at scala.collection.immutable.List.foreach(List.scala:431) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at org.scalates
[jira] [Commented] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569395#comment-17569395 ] Apache Spark commented on SPARK-39622: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37245 > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > at scala.collection.immutable.List.foreach(List.scala:431) >
[jira] [Commented] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569393#comment-17569393 ] Apache Spark commented on SPARK-39831: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37243 > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39831: Assignee: Apache Spark > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
[ https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39831: Assignee: (was: Apache Spark) > R dependencies installation start to fail after devtools_2.4.4 was released > --- > > Key: SPARK-39831 > URL: https://issues.apache.org/jira/browse/SPARK-39831 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released
Ruifeng Zheng created SPARK-39831: - Summary: R dependencies installation start to fail after devtools_2.4.4 was released Key: SPARK-39831 URL: https://issues.apache.org/jira/browse/SPARK-39831 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569385#comment-17569385 ] dzcxzl commented on SPARK-39830: cc @[~dongjoon] > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Priority: Trivial > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39830: --- Description: We can add a UT to test the scenario after the ORC-1205 release. bin/spark-shell {code:java} spark.sql("set orc.stripe.size=10240") spark.sql("set orc.rows.between.memory.checks=1") spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") val df = spark.range(1, 1+512, 1, 1).map { i => if( i == 1 ){ (i, Array.fill[Byte](5 * 1024 * 1024)('X')) } else { (i,Array.fill[Byte](1)('X')) } }.toDF("c1","c2") df.write.format("orc").save("file:///tmp/test_table_orc_t1") spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) location 'file:///tmp/test_table_orc_t1' stored as orc ") spark.sql("select * from test_table_orc_t1").show() {code} Querying this table will get the following exception {code:java} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) {code} was: {code:java} spark.sql("set orc.stripe.size=10240") spark.sql("set orc.rows.between.memory.checks=1") spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") val df = spark.range(1, 1+512, 1, 1).map { i => if( i == 1 ){ (i, Array.fill[Byte](5 * 1024 * 1024)('X')) } else { (i,Array.fill[Byte](1)('X')) } }.toDF("c1","c2") df.write.format("orc").save("file:///tmp/test_table_orc_t1") spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) location 'file:///tmp/test_table_orc_t1' stored as orc ") spark.sql("select * from test_table_orc_t1").show() {code} Querying this table will get the following exception {code:java} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) {code} We can add a UT to test the scenario after the [ORC-1205|https://issues.apache.org/jira/browse/ORC-1205] release > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Priority: Trivial > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external
[jira] [Created] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
dzcxzl created SPARK-39830: -- Summary: Reading ORC table that requires type promotion may throw AIOOBE Key: SPARK-39830 URL: https://issues.apache.org/jira/browse/SPARK-39830 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: dzcxzl {code:java} spark.sql("set orc.stripe.size=10240") spark.sql("set orc.rows.between.memory.checks=1") spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") val df = spark.range(1, 1+512, 1, 1).map { i => if( i == 1 ){ (i, Array.fill[Byte](5 * 1024 * 1024)('X')) } else { (i,Array.fill[Byte](1)('X')) } }.toDF("c1","c2") df.write.format("orc").save("file:///tmp/test_table_orc_t1") spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) location 'file:///tmp/test_table_orc_t1' stored as orc ") spark.sql("select * from test_table_orc_t1").show() {code} Querying this table will get the following exception {code:java} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) {code} We can add a UT to test the scenario after the [ORC-1205|https://issues.apache.org/jira/browse/ORC-1205] release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38597) Enable resource limited spark k8s IT in GA
[ https://issues.apache.org/jira/browse/SPARK-38597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569382#comment-17569382 ] Apache Spark commented on SPARK-38597: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37244 > Enable resource limited spark k8s IT in GA > -- > > Key: SPARK-38597 > URL: https://issues.apache.org/jira/browse/SPARK-38597 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38691) Use error classes in the compilation errors of column/attr resolving
[ https://issues.apache.org/jira/browse/SPARK-38691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569352#comment-17569352 ] Goutam Ghosh commented on SPARK-38691: -- I am working on this > Use error classes in the compilation errors of column/attr resolving > > > Key: SPARK-38691 > URL: https://issues.apache.org/jira/browse/SPARK-38691 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * cannotResolveUserSpecifiedColumnsError > * cannotResolveStarExpandGivenInputColumnsError > * cannotResolveAttributeError > * cannotResolveColumnGivenInputColumnsError > * cannotResolveColumnNameAmongAttributesError > * cannotResolveColumnNameAmongFieldsError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39827) add_months() returns a java error on overflow
[ https://issues.apache.org/jira/browse/SPARK-39827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39827. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37240 [https://github.com/apache/spark/pull/37240] > add_months() returns a java error on overflow > - > > Key: SPARK-39827 > URL: https://issues.apache.org/jira/browse/SPARK-39827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > The code below throws an Java exception, see > {code:java} > spark.sql("SET spark.sql.ansi.enabled=true").show()spark.sql("SELECT > add_months('550-12-31', 1000)").show()java.lang.ArithmeticException: > integer overflow at java.base/java.lang.Math.toIntExact(Math.java:1074) at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.localDateToDays(DateTimeUtils.scala:550) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.dateAddMonths(DateTimeUtils.scala:736) > {code} > but it should throw Spark's exception w/ an error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39622) ParquetIOSuite fails intermittently on master branch
[ https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569324#comment-17569324 ] Yang Jie commented on SPARK-39622: -- Maybe the test suite started flaky after SPARK-39195 was merged. I revert it and ran "SPARK-7837 Do not close output writer twice when commitTask() fails" dozens of times without failure. Still investigate the root cause. [~kabhwan] [~hyukjin.kwon] [~cloud_fan] > ParquetIOSuite fails intermittently on master branch > > > Key: SPARK-39622 > URL: https://issues.apache.org/jira/browse/SPARK-39622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > "SPARK-7837 Do not close output writer twice when commitTask() fails" in > ParquetIOSuite fails intermittently with master branch. > Assertion error follows: > {code} > "Job aborted due to stage failure: Authorized committer (attemptNumber=0, > stage=1, partition=0) failed; but task commit success, data duplication may > happen." did not contain "Intentional exception for testing purposes" > ScalaTestFailureLocation: > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at > (ParquetIOSuite.scala:1216) > org.scalatest.exceptions.TestFailedException: "Job aborted due to stage > failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; > but task commit success, data duplication may happen." did not contain > "Intentional exception for testing purposes" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) > at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247) > at > org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) > at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) > at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) >
[jira] [Resolved] (SPARK-39469) Infer date type for CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39469. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36871 [https://github.com/apache/spark/pull/36871] > Infer date type for CSV schema inference > > > Key: SPARK-39469 > URL: https://issues.apache.org/jira/browse/SPARK-39469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Jonathan Cui >Priority: Major > Fix For: 3.4.0 > > > 1. If a column contains only dates, it should be of “date” type in the > inferred schema > * If the date format and the timestamp format are identical (e.g. both are > /mm/dd), entries will default to being interpreted as Date > 2. If a column contains dates and timestamps, it should be of “timestamp” > type in the inferred schema > > A similar issue was opened in the past but was reverted due to the lack of > strict pattern matching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38916) Tasks not killed caused by race conditions between killTask() and launchTask()
[ https://issues.apache.org/jira/browse/SPARK-38916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-38916: Fix Version/s: (was: 3.2.2) > Tasks not killed caused by race conditions between killTask() and launchTask() > -- > > Key: SPARK-38916 > URL: https://issues.apache.org/jira/browse/SPARK-38916 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > Fix For: 3.3.0 > > > Sometimes when the scheduler tries to cancel a task right after it launches > that task on the executor, the KillTask and LaunchTask events can come in a > reversed order, causing the task to escape the kill-task signal and finish > "secretly". And those tasks even show as an un-launched task in Spark UI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39469) Infer date type for CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39469: --- Assignee: Jonathan Cui > Infer date type for CSV schema inference > > > Key: SPARK-39469 > URL: https://issues.apache.org/jira/browse/SPARK-39469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Jonathan Cui >Assignee: Jonathan Cui >Priority: Major > Fix For: 3.4.0 > > > 1. If a column contains only dates, it should be of “date” type in the > inferred schema > * If the date format and the timestamp format are identical (e.g. both are > /mm/dd), entries will default to being interpreted as Date > 2. If a column contains dates and timestamps, it should be of “timestamp” > type in the inferred schema > > A similar issue was opened in the past but was reverted due to the lack of > strict pattern matching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org