[jira] [Assigned] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39838:


Assignee: (was: Apache Spark)

> Passing an empty Metadata object to Column.as() should clear the metadata
> -
>
> Key: SPARK-39838
> URL: https://issues.apache.org/jira/browse/SPARK-39838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kaya Kupferschmidt
>Priority: Major
>
> h2. Description
> The Spark DataFrame API allows developers to attach arbiotrary metadata to 
> individual columns as key/value pairs. The attachment is performed via the 
> method "Column.as(name, metadata)". This works as expected, as long as the 
> metadata object is not empty. But when passing an empty metadata object, the 
> final column in the resulting DataFrame will still hold the metadata of the 
> original incoming column, i.e. you cannot use this method to essentially 
> reset the metadata of a column.
> This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
> 3.2.1 and earlier, passing an empty metadata object to the method 
> "Column.as(name, metadata)" resets the columns metadata as expected.
> h2. Steps to Reproduce
> The following code snippet will show the issue in Spark shell:
> {code:scala}
> import org.apache.spark.sql.types.MetadataBuilder
> // Create a DataFrame with one column with Metadata attached
> val df1 = spark.range(1,10)
> .withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
> MetadataBuilder().putString("metadata", "value").build()))
> // Create a derived DataFrame which should reset the metadata of the column
> val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
> MetadataBuilder().build()))
> // Display metadata of both DataFrames columns
> println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
> println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
> {code} 
> This code results in the following lines printed onto the console
> {code}
> df1 metadata: {"metadata":"value"}
> df2 metadata: {"metadata":"value"}
> {code}
> This result does not meet my expectations. I expect that df1 has non-empty 
> metadata, but df2 should have empty metadata. But this is not the case, df2 
> still holds the same metadata as df1.
> h2. Analysis
> I think the problem stems from the changes in the method 
> "trimNonTopLevelAliases" in the class AliasHelper:
> {code:scala}
>   protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
> val res = e match {
>   case a: Alias =>
> val metadata = if (a.metadata == Metadata.empty) {
>   None
> } else {
>   Some(a.metadata)
> }
> a.copy(child = trimAliases(a.child))(
>   exprId = a.exprId,
>   qualifier = a.qualifier,
>   explicitMetadata = metadata,
>   nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
>   case a: MultiAlias =>
> a.copy(child = trimAliases(a.child))
>   case other => trimAliases(other)
> }
> res.asInstanceOf[T]
>   }
> {code}
> The method will remove any empty metadata object from an Alias, which in turn 
> means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569843#comment-17569843
 ] 

Apache Spark commented on SPARK-39838:
--

User 'kupferk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37251

> Passing an empty Metadata object to Column.as() should clear the metadata
> -
>
> Key: SPARK-39838
> URL: https://issues.apache.org/jira/browse/SPARK-39838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kaya Kupferschmidt
>Priority: Major
>
> h2. Description
> The Spark DataFrame API allows developers to attach arbiotrary metadata to 
> individual columns as key/value pairs. The attachment is performed via the 
> method "Column.as(name, metadata)". This works as expected, as long as the 
> metadata object is not empty. But when passing an empty metadata object, the 
> final column in the resulting DataFrame will still hold the metadata of the 
> original incoming column, i.e. you cannot use this method to essentially 
> reset the metadata of a column.
> This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
> 3.2.1 and earlier, passing an empty metadata object to the method 
> "Column.as(name, metadata)" resets the columns metadata as expected.
> h2. Steps to Reproduce
> The following code snippet will show the issue in Spark shell:
> {code:scala}
> import org.apache.spark.sql.types.MetadataBuilder
> // Create a DataFrame with one column with Metadata attached
> val df1 = spark.range(1,10)
> .withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
> MetadataBuilder().putString("metadata", "value").build()))
> // Create a derived DataFrame which should reset the metadata of the column
> val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
> MetadataBuilder().build()))
> // Display metadata of both DataFrames columns
> println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
> println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
> {code} 
> This code results in the following lines printed onto the console
> {code}
> df1 metadata: {"metadata":"value"}
> df2 metadata: {"metadata":"value"}
> {code}
> This result does not meet my expectations. I expect that df1 has non-empty 
> metadata, but df2 should have empty metadata. But this is not the case, df2 
> still holds the same metadata as df1.
> h2. Analysis
> I think the problem stems from the changes in the method 
> "trimNonTopLevelAliases" in the class AliasHelper:
> {code:scala}
>   protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
> val res = e match {
>   case a: Alias =>
> val metadata = if (a.metadata == Metadata.empty) {
>   None
> } else {
>   Some(a.metadata)
> }
> a.copy(child = trimAliases(a.child))(
>   exprId = a.exprId,
>   qualifier = a.qualifier,
>   explicitMetadata = metadata,
>   nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
>   case a: MultiAlias =>
> a.copy(child = trimAliases(a.child))
>   case other => trimAliases(other)
> }
> res.asInstanceOf[T]
>   }
> {code}
> The method will remove any empty metadata object from an Alias, which in turn 
> means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39838:


Assignee: Apache Spark

> Passing an empty Metadata object to Column.as() should clear the metadata
> -
>
> Key: SPARK-39838
> URL: https://issues.apache.org/jira/browse/SPARK-39838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kaya Kupferschmidt
>Assignee: Apache Spark
>Priority: Major
>
> h2. Description
> The Spark DataFrame API allows developers to attach arbiotrary metadata to 
> individual columns as key/value pairs. The attachment is performed via the 
> method "Column.as(name, metadata)". This works as expected, as long as the 
> metadata object is not empty. But when passing an empty metadata object, the 
> final column in the resulting DataFrame will still hold the metadata of the 
> original incoming column, i.e. you cannot use this method to essentially 
> reset the metadata of a column.
> This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
> 3.2.1 and earlier, passing an empty metadata object to the method 
> "Column.as(name, metadata)" resets the columns metadata as expected.
> h2. Steps to Reproduce
> The following code snippet will show the issue in Spark shell:
> {code:scala}
> import org.apache.spark.sql.types.MetadataBuilder
> // Create a DataFrame with one column with Metadata attached
> val df1 = spark.range(1,10)
> .withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
> MetadataBuilder().putString("metadata", "value").build()))
> // Create a derived DataFrame which should reset the metadata of the column
> val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
> MetadataBuilder().build()))
> // Display metadata of both DataFrames columns
> println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
> println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
> {code} 
> This code results in the following lines printed onto the console
> {code}
> df1 metadata: {"metadata":"value"}
> df2 metadata: {"metadata":"value"}
> {code}
> This result does not meet my expectations. I expect that df1 has non-empty 
> metadata, but df2 should have empty metadata. But this is not the case, df2 
> still holds the same metadata as df1.
> h2. Analysis
> I think the problem stems from the changes in the method 
> "trimNonTopLevelAliases" in the class AliasHelper:
> {code:scala}
>   protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
> val res = e match {
>   case a: Alias =>
> val metadata = if (a.metadata == Metadata.empty) {
>   None
> } else {
>   Some(a.metadata)
> }
> a.copy(child = trimAliases(a.child))(
>   exprId = a.exprId,
>   qualifier = a.qualifier,
>   explicitMetadata = metadata,
>   nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
>   case a: MultiAlias =>
> a.copy(child = trimAliases(a.child))
>   case other => trimAliases(other)
> }
> res.asInstanceOf[T]
>   }
> {code}
> The method will remove any empty metadata object from an Alias, which in turn 
> means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

2022-07-21 Thread Kaya Kupferschmidt (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569842#comment-17569842
 ] 

Kaya Kupferschmidt commented on SPARK-39838:


Please find a PR at https://github.com/apache/spark/pull/37251

> Passing an empty Metadata object to Column.as() should clear the metadata
> -
>
> Key: SPARK-39838
> URL: https://issues.apache.org/jira/browse/SPARK-39838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kaya Kupferschmidt
>Priority: Major
>
> h2. Description
> The Spark DataFrame API allows developers to attach arbiotrary metadata to 
> individual columns as key/value pairs. The attachment is performed via the 
> method "Column.as(name, metadata)". This works as expected, as long as the 
> metadata object is not empty. But when passing an empty metadata object, the 
> final column in the resulting DataFrame will still hold the metadata of the 
> original incoming column, i.e. you cannot use this method to essentially 
> reset the metadata of a column.
> This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
> 3.2.1 and earlier, passing an empty metadata object to the method 
> "Column.as(name, metadata)" resets the columns metadata as expected.
> h2. Steps to Reproduce
> The following code snippet will show the issue in Spark shell:
> {code:scala}
> import org.apache.spark.sql.types.MetadataBuilder
> // Create a DataFrame with one column with Metadata attached
> val df1 = spark.range(1,10)
> .withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
> MetadataBuilder().putString("metadata", "value").build()))
> // Create a derived DataFrame which should reset the metadata of the column
> val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
> MetadataBuilder().build()))
> // Display metadata of both DataFrames columns
> println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
> println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
> {code} 
> This code results in the following lines printed onto the console
> {code}
> df1 metadata: {"metadata":"value"}
> df2 metadata: {"metadata":"value"}
> {code}
> This result does not meet my expectations. I expect that df1 has non-empty 
> metadata, but df2 should have empty metadata. But this is not the case, df2 
> still holds the same metadata as df1.
> h2. Analysis
> I think the problem stems from the changes in the method 
> "trimNonTopLevelAliases" in the class AliasHelper:
> {code:scala}
>   protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
> val res = e match {
>   case a: Alias =>
> val metadata = if (a.metadata == Metadata.empty) {
>   None
> } else {
>   Some(a.metadata)
> }
> a.copy(child = trimAliases(a.child))(
>   exprId = a.exprId,
>   qualifier = a.qualifier,
>   explicitMetadata = metadata,
>   nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
>   case a: MultiAlias =>
> a.copy(child = trimAliases(a.child))
>   case other => trimAliases(other)
> }
> res.asInstanceOf[T]
>   }
> {code}
> The method will remove any empty metadata object from an Alias, which in turn 
> means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

2022-07-21 Thread Kaya Kupferschmidt (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaya Kupferschmidt updated SPARK-39838:
---
Description: 
h2. Description

The Spark DataFrame API allows developers to attach arbiotrary metadata to 
individual columns as key/value pairs. The attachment is performed via the 
method "Column.as(name, metadata)". This works as expected, as long as the 
metadata object is not empty. But when passing an empty metadata object, the 
final column in the resulting DataFrame will still hold the metadata of the 
original incoming column, i.e. you cannot use this method to essentially reset 
the metadata of a column.

This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
3.2.1 and earlier, passing an empty metadata object to the method 
"Column.as(name, metadata)" resets the columns metadata as expected.

h2. Steps to Reproduce

The following code snippet will show the issue in Spark shell:
{code:scala}
import org.apache.spark.sql.types.MetadataBuilder

// Create a DataFrame with one column with Metadata attached
val df1 = spark.range(1,10)
.withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
MetadataBuilder().putString("metadata", "value").build()))

// Create a derived DataFrame which should reset the metadata of the column
val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
MetadataBuilder().build()))

// Display metadata of both DataFrames columns
println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
{code} 

This code results in the following lines printed onto the console
{code}
df1 metadata: {"metadata":"value"}
df2 metadata: {"metadata":"value"}
{code}

This result does not meet my expectations. I expect that df1 has non-empty 
metadata, but df2 should have empty metadata. But this is not the case, df2 
still holds the same metadata as df1.

h2. Analysis

I think the problem stems from the changes in the method 
"trimNonTopLevelAliases" in the class AliasHelper:
{code:scala}
  protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
val res = e match {
  case a: Alias =>
val metadata = if (a.metadata == Metadata.empty) {
  None
} else {
  Some(a.metadata)
}
a.copy(child = trimAliases(a.child))(
  exprId = a.exprId,
  qualifier = a.qualifier,
  explicitMetadata = metadata,
  nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
  case a: MultiAlias =>
a.copy(child = trimAliases(a.child))
  case other => trimAliases(other)
}

res.asInstanceOf[T]
  }
{code}

The method will remove any empty metadata object from an Alias, which in turn 
means that Alias will inherit its childs metadata.

  was:
h2. Description

The Spark DataFrame API allows developers to attach arbiotrary metadata to 
individual columns as key/value pairs. The attachment is performed via the 
method "Column.as(name, metadata)". This works as expected, as long as the 
metadata object is not empty. But when passing an empty metadata object, the 
final column in the resulting DataFrame will still hold the metadata of the 
original incoming column, i.e. you cannot use this method to essentially reset 
the metadata of a column.

This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
3.2.1 and earlier, passing an empty metadata object to the method 
"Column.as(name, metadata)" resets the columns metadata as expected.

h2. Steps to Reproduce

The following code snippet will show the issue in Spark shell:
{code:scala}
import org.apache.spark.sql.types.MetadataBuilder

// Create a DataFrame with one column with Metadata attached
val df1 = spark.range(1,10)
.withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
MetadataBuilder().putString("metadata", "value").build()))

// Create a derived DataFrame which should reset the metadata of the column
val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
MetadataBuilder().build()))

// Display metadata of both DataFrames columns
println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
{code} 

Expected would be that df1 has non-empty metadata, but df2 has empty metadata. 
But this is not the case, df2 still holds the same metadata as df1.

h2. Analysis

I think the problem stems from the changes in the method 
"trimNonTopLevelAliases" in the class AliasHelper:
{code:scala}
  protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
val res = e match {
  case a: Alias =>
val metadata = if (a.metadata == Metadata.empty) {
  None
} else {
  Some(a.metadata)
}
a.copy(child = trimAliases(a.child))(

[jira] [Created] (SPARK-39838) Passing an empty Metadata object to Column.as() should clear the metadata

2022-07-21 Thread Kaya Kupferschmidt (Jira)
Kaya Kupferschmidt created SPARK-39838:
--

 Summary: Passing an empty Metadata object to Column.as() should 
clear the metadata
 Key: SPARK-39838
 URL: https://issues.apache.org/jira/browse/SPARK-39838
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kaya Kupferschmidt


h2. Description

The Spark DataFrame API allows developers to attach arbiotrary metadata to 
individual columns as key/value pairs. The attachment is performed via the 
method "Column.as(name, metadata)". This works as expected, as long as the 
metadata object is not empty. But when passing an empty metadata object, the 
final column in the resulting DataFrame will still hold the metadata of the 
original incoming column, i.e. you cannot use this method to essentially reset 
the metadata of a column.

This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
3.2.1 and earlier, passing an empty metadata object to the method 
"Column.as(name, metadata)" resets the columns metadata as expected.

h2. Steps to Reproduce

The following code snippet will show the issue in Spark shell:
{code:scala}
import org.apache.spark.sql.types.MetadataBuilder

// Create a DataFrame with one column with Metadata attached
val df1 = spark.range(1,10)
.withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
MetadataBuilder().putString("metadata", "value").build()))

// Create a derived DataFrame which should reset the metadata of the column
val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
MetadataBuilder().build()))

// Display metadata of both DataFrames columns
println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
{code} 

Expected would be that df1 has non-empty metadata, but df2 has empty metadata. 
But this is not the case, df2 still holds the same metadata as df1.

h2. Analysis

I think the problem stems from the changes in the method 
"trimNonTopLevelAliases" in the class AliasHelper:
{code:scala}
  protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
val res = e match {
  case a: Alias =>
val metadata = if (a.metadata == Metadata.empty) {
  None
} else {
  Some(a.metadata)
}
a.copy(child = trimAliases(a.child))(
  exprId = a.exprId,
  qualifier = a.qualifier,
  explicitMetadata = metadata,
  nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
  case a: MultiAlias =>
a.copy(child = trimAliases(a.child))
  case other => trimAliases(other)
}

res.asInstanceOf[T]
  }
{code}

The method will remove any empty metadata object from an Alias, which in turn 
means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39835:


Assignee: Apache Spark

> Fix EliminateSorts remove global sort below the local sort
> --
>
> Key: SPARK-39835
> URL: https://issues.apache.org/jira/browse/SPARK-39835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> If a global sort below locol sort, we should not remove the global sort 
> becuase the output partitioning can be affected.
> This issue is going to worse since we pull out the V1 Write sort to logcial 
> side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569773#comment-17569773
 ] 

Apache Spark commented on SPARK-39835:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37250

> Fix EliminateSorts remove global sort below the local sort
> --
>
> Key: SPARK-39835
> URL: https://issues.apache.org/jira/browse/SPARK-39835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> If a global sort below locol sort, we should not remove the global sort 
> becuase the output partitioning can be affected.
> This issue is going to worse since we pull out the V1 Write sort to logcial 
> side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39835:


Assignee: (was: Apache Spark)

> Fix EliminateSorts remove global sort below the local sort
> --
>
> Key: SPARK-39835
> URL: https://issues.apache.org/jira/browse/SPARK-39835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> If a global sort below locol sort, we should not remove the global sort 
> becuase the output partitioning can be affected.
> This issue is going to worse since we pull out the V1 Write sort to logcial 
> side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39837) Filesystem leak when running `TPC-DS queries with SF=1`

2022-07-21 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39837:
-
Description: 
Following log in `TPC-DS queries with SF=1` GA logs:

 
{code:java}
2022-07-22T00:19:52.8539664Z 00:19:52.849 WARN 
org.apache.spark.DebugFilesystem: Leaked filesystem connection created at:
2022-07-22T00:19:52.8548926Z java.lang.Throwable
2022-07-22T00:19:52.8568135Zat 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
2022-07-22T00:19:52.8573547Zat 
org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
2022-07-22T00:19:52.8574108Zat 
org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
2022-07-22T00:19:52.8578427Zat 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
2022-07-22T00:19:52.8579211Zat 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774)
2022-07-22T00:19:52.8589698Zat 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100)
2022-07-22T00:19:52.8590842Zat 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:175)
2022-07-22T00:19:52.8594751Zat 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:340)
2022-07-22T00:19:52.8595634Zat 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:211)
2022-07-22T00:19:52.8598975Zat 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:272)
2022-07-22T00:19:52.8599639Zat 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
2022-07-22T00:19:52.8602839Zat 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:583)
2022-07-22T00:19:52.8603625Zat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown
 Source)
2022-07-22T00:19:52.8606618Zat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
 Source)
2022-07-22T00:19:52.8609954Zat 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
2022-07-22T00:19:52.8620028Zat 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
2022-07-22T00:19:52.8623148Zat 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
2022-07-22T00:19:52.8623812Zat 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
2022-07-22T00:19:52.8627344Zat 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
2022-07-22T00:19:52.8628031Zat 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
2022-07-22T00:19:52.8637881Zat 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
2022-07-22T00:19:52.8638603Zat 
org.apache.spark.scheduler.Task.run(Task.scala:139)
2022-07-22T00:19:52.8644696Zat 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
2022-07-22T00:19:52.8645352Zat 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
2022-07-22T00:19:52.8649598Zat 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
2022-07-22T00:19:52.8650238Zat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-07-22T00:19:52.8657783Zat 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-07-22T00:19:52.8658260Zat java.lang.Thread.run(Thread.java:750){code}
 

 

Actions have similar to log:
 * [https://github.com/apache/spark/runs/7460003953?check_suite_focus=true]
 * [https://github.com/apache/spark/runs/7459868605?check_suite_focus=true]
 * [https://github.com/apache/spark/runs/7460262731?check_suite_focus=true]

 

 

  was:
Following log in `TPC-DS queries with SF=1` GA logs:

 
{code:java}
2022-07-22T00:48:19.8046575Z 00:48:19.800 WARN 
org.apache.spark.DebugFilesystem: Leaked filesystem connection created at:
2022-07-22T00:48:19.8183197Z java.lang.Throwable
2022-07-22T00:48:19.8209541Zat 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
2022-07-22T00:48:19.8364870Zat 
org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
2022-07-22T00:48:19.8429477Zat 
org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
2022-07-22T00:48:19.8440381Zat 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
2022-07-22T00:48:19.8463114Zat 
org.apache.parquet.hadoop.ParquetFil

[jira] [Created] (SPARK-39837) Filesystem leak when running `TPC-DS queries with SF=1`

2022-07-21 Thread Yang Jie (Jira)
Yang Jie created SPARK-39837:


 Summary: Filesystem leak when running `TPC-DS queries with SF=1`
 Key: SPARK-39837
 URL: https://issues.apache.org/jira/browse/SPARK-39837
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.4.0
Reporter: Yang Jie


Following log in `TPC-DS queries with SF=1` GA logs:

 
{code:java}
2022-07-22T00:48:19.8046575Z 00:48:19.800 WARN 
org.apache.spark.DebugFilesystem: Leaked filesystem connection created at:
2022-07-22T00:48:19.8183197Z java.lang.Throwable
2022-07-22T00:48:19.8209541Zat 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
2022-07-22T00:48:19.8364870Zat 
org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
2022-07-22T00:48:19.8429477Zat 
org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
2022-07-22T00:48:19.8440381Zat 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
2022-07-22T00:48:19.8463114Zat 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774)
2022-07-22T00:48:19.8483110Zat 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100)
2022-07-22T00:48:19.8492740Zat 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:175)
2022-07-22T00:48:19.8507149Zat 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:340)
2022-07-22T00:48:19.8525518Zat 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:211)
2022-07-22T00:48:19.8536791Zat 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:272)
2022-07-22T00:48:19.8542997Zat 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
2022-07-22T00:48:19.8548773Zat 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:583)
2022-07-22T00:48:19.8552000Zat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown
 Source)
2022-07-22T00:48:19.8561197Zat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
 Source)
2022-07-22T00:48:19.8564920Zat 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
2022-07-22T00:48:19.8570921Zat 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
2022-07-22T00:48:19.8578211Zat 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
2022-07-22T00:48:19.8581739Zat 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
2022-07-22T00:48:19.8588053Zat 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
2022-07-22T00:48:19.8591953Zat 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
2022-07-22T00:48:19.8599896Zat 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
2022-07-22T00:48:19.8605778Zat 
org.apache.spark.scheduler.Task.run(Task.scala:139)
2022-07-22T00:48:19.8609467Zat 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
2022-07-22T00:48:19.8610083Zat 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
2022-07-22T00:48:19.8614645Zat 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
2022-07-22T00:48:19.8616327Zat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2022-07-22T00:48:19.8620080Zat 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2022-07-22T00:48:19.8620695Zat java.lang.Thread.run(Thread.java:750) {code}
 

 

Actions have similar to log:
 * [https://github.com/apache/spark/runs/7460003953?check_suite_focus=true]
 * [https://github.com/apache/spark/runs/7459868605?check_suite_focus=true]
 * https://github.com/apache/spark/runs/7460262731?check_suite_focus=true

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569771#comment-17569771
 ] 

Apache Spark commented on SPARK-39836:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37249

> Simplify V2ExpressionBuilder by extract common method.
> --
>
> Key: SPARK-39836
> URL: https://issues.apache.org/jira/browse/SPARK-39836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, V2ExpressionBuilder have a lot of similar code, we can extract 
> them as one common method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39836:


Assignee: Apache Spark

> Simplify V2ExpressionBuilder by extract common method.
> --
>
> Key: SPARK-39836
> URL: https://issues.apache.org/jira/browse/SPARK-39836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, V2ExpressionBuilder have a lot of similar code, we can extract 
> them as one common method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39836:


Assignee: (was: Apache Spark)

> Simplify V2ExpressionBuilder by extract common method.
> --
>
> Key: SPARK-39836
> URL: https://issues.apache.org/jira/browse/SPARK-39836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, V2ExpressionBuilder have a lot of similar code, we can extract 
> them as one common method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569770#comment-17569770
 ] 

Apache Spark commented on SPARK-39836:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37249

> Simplify V2ExpressionBuilder by extract common method.
> --
>
> Key: SPARK-39836
> URL: https://issues.apache.org/jira/browse/SPARK-39836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, V2ExpressionBuilder have a lot of similar code, we can extract 
> them as one common method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39836) Simplify V2ExpressionBuilder by extract common method.

2022-07-21 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-39836:
--

 Summary: Simplify V2ExpressionBuilder by extract common method.
 Key: SPARK-39836
 URL: https://issues.apache.org/jira/browse/SPARK-39836
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Currently, V2ExpressionBuilder have a lot of similar code, we can extract them 
as one common method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group column)

2022-07-21 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-39819 ]


jiaan.geng deleted comment on SPARK-39819:


was (Author: beliefer):
I'm working on.

> DS V2 aggregate push down can work with Top N or Paging (Sort with group 
> column)
> 
>
> Key: SPARK-39819
> URL: https://issues.apache.org/jira/browse/SPARK-39819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, DS V2 aggregate push-down cannot work with Top N (order by ... 
> limit ...) or Paging (order by ... limit ... offset ...).
> If it can work with Top N or Paging, it will be better performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39835) Fix EliminateSorts remove global sort below the local sort

2022-07-21 Thread XiDuo You (Jira)
XiDuo You created SPARK-39835:
-

 Summary: Fix EliminateSorts remove global sort below the local sort
 Key: SPARK-39835
 URL: https://issues.apache.org/jira/browse/SPARK-39835
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


If a global sort below locol sort, we should not remove the global sort becuase 
the output partitioning can be affected.

This issue is going to worse since we pull out the V1 Write sort to logcial 
side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39831:


Assignee: Yikun Jiang

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39831.
--
Fix Version/s: 3.3.1
   3.1.4
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37247
[https://github.com/apache/spark/pull/37247]

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39834:


Assignee: Apache Spark

> Include the origin stats and constraints for LogicalRDD if it comes from 
> DataFrame
> --
>
> Key: SPARK-39834
> URL: https://issues.apache.org/jira/browse/SPARK-39834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
> comes from DataFrame, to achieve carrying-over stats as well as providing 
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially 
> "iterative algorithm", which purpose is to "prune" the logical plan. That is 
> against the purpose of including origin logical plan and we have a risk to 
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats 
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry 
> over.
> To address above issues, it would be better if we include stats and 
> constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39834:


Assignee: (was: Apache Spark)

> Include the origin stats and constraints for LogicalRDD if it comes from 
> DataFrame
> --
>
> Key: SPARK-39834
> URL: https://issues.apache.org/jira/browse/SPARK-39834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
> comes from DataFrame, to achieve carrying-over stats as well as providing 
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially 
> "iterative algorithm", which purpose is to "prune" the logical plan. That is 
> against the purpose of including origin logical plan and we have a risk to 
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats 
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry 
> over.
> To address above issues, it would be better if we include stats and 
> constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569762#comment-17569762
 ] 

Apache Spark commented on SPARK-39834:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37248

> Include the origin stats and constraints for LogicalRDD if it comes from 
> DataFrame
> --
>
> Key: SPARK-39834
> URL: https://issues.apache.org/jira/browse/SPARK-39834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
> comes from DataFrame, to achieve carrying-over stats as well as providing 
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially 
> "iterative algorithm", which purpose is to "prune" the logical plan. That is 
> against the purpose of including origin logical plan and we have a risk to 
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats 
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry 
> over.
> To address above issues, it would be better if we include stats and 
> constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

2022-07-21 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-39834:


 Summary: Include the origin stats and constraints for LogicalRDD 
if it comes from DataFrame
 Key: SPARK-39834
 URL: https://issues.apache.org/jira/browse/SPARK-39834
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
comes from DataFrame, to achieve carrying-over stats as well as providing 
information to possibly connect two disconnected logical plans into one.

After we introduced the change, we figured out several issues:

1. One of major use case for DataFrame.checkpoint is ML, especially "iterative 
algorithm", which purpose is to "prune" the logical plan. That is against the 
purpose of including origin logical plan and we have a risk to have nested 
LogicalRDDs which grows the size of logical plan infinitely.

2. We leverage logical plan to carry over stats, but the correct stats 
information is in optimized plan.

3. (Not an issue but missing spot) constraints is also something we can carry 
over.

To address above issues, it would be better if we include stats and constraints 
in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569735#comment-17569735
 ] 

Apache Spark commented on SPARK-39831:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37247

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39826) Bump scalatest-maven-plugin to 2.1.0

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39826.
--
Resolution: Fixed

Issue resolved by pull request 37237
[https://github.com/apache/spark/pull/37237]

> Bump scalatest-maven-plugin to 2.1.0
> 
>
> Key: SPARK-39826
> URL: https://issues.apache.org/jira/browse/SPARK-39826
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39826) Bump scalatest-maven-plugin to 2.1.0

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39826:


Assignee: BingKun Pan

> Bump scalatest-maven-plugin to 2.1.0
> 
>
> Key: SPARK-39826
> URL: https://issues.apache.org/jira/browse/SPARK-39826
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569725#comment-17569725
 ] 

Hyukjin Kwon commented on SPARK-39831:
--

Reverted in:
https://github.com/apache/spark/commit/2bec66177de36d449dd6adebd8b6dd227ef40726
https://github.com/apache/spark/commit/248f34e46d591396f32bed79730a7b5b3141e7e9
https://github.com/apache/spark/commit/f344bf97265306b50ab79e465535054e2d582877
https://github.com/apache/spark/commit/b54d985223e07963db4b62a00dd29ebd012382ad

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39831:
-
Fix Version/s: (was: 3.1.4)
   (was: 3.4.0)
   (was: 3.3.1)
   (was: 3.2.3)

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39622.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37245
[https://github.com/apache/spark/pull/37245]

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
>   at scala.collection.immuta

[jira] [Assigned] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39622:
-

Assignee: Yang Jie

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Yang Jie
>Priority: Major
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at org.scalatest.Supe

[jira] [Created] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

2022-07-21 Thread Michael Allman (Jira)
Michael Allman created SPARK-39833:
--

 Summary: Filtered parquet data frame count() and show() produce 
inconsistent results when spark.sql.parquet.filterPushdown is true
 Key: SPARK-39833
 URL: https://issues.apache.org/jira/browse/SPARK-39833
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Michael Allman


One of our data scientists discovered a problem wherein a data frame `.show()` 
call printed non-empty results, but `.count()` printed 0. I've narrowed the 
issue to a small, reproducible test case which exhibits this aberrant behavior. 
In pyspark, run the following code:
{code:python}
from pyspark.sql.types import *
parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], 
schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
{code}
In my usage, this prints a data frame with 1 row and a count of 0. However, 
disabling `spark.sql.parquet.filterPushdown` produces consistent results:
{code:python}
spark.conf.set("spark.sql.parquet.filterPushdown", False)
reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
{code}
This will print the same data frame, however it will print a count of 1. The 
key to triggering this bug is not just enabling 
`spark.sql.parquet.filterPushdown` (which is enabled by default). The case of 
the column in the data frame (before writing) must differ from the case of the 
partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39832) regexp_replace should support column arguments

2022-07-21 Thread Brian Schaefer (Jira)
Brian Schaefer created SPARK-39832:
--

 Summary: regexp_replace should support column arguments
 Key: SPARK-39832
 URL: https://issues.apache.org/jira/browse/SPARK-39832
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Brian Schaefer


{{F.regexp_replace}} in PySpark currently only supports strings for the second 
and third argument: 
[https://github.com/apache/spark/blob/1df6006ea977ae3b8c53fe33630e277e8c1bc49c/python/pyspark/sql/functions.py#L3265]

In Scala, columns are also supported: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2836|https://github.com/apache/spark/blob/1df6006ea977ae3b8c53fe33630e277e8c1bc49c/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2836]

The desire to use columns as arguments for the function has been raised 
previously on StackExchange: 
[https://stackoverflow.com/questions/64613761/in-pyspark-using-regexp-replace-how-to-replace-a-group-with-value-from-another|https://stackoverflow.com/questions/64613761/in-pyspark-using-regexp-replace-how-to-replace-a-group-with-value-from-another,],
 where the suggested fix was to use {{{}F.expr{}}}.

It should be relatively straightforward to support in PySpark the two function 
signatures supported in Scala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE

2022-07-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569605#comment-17569605
 ] 

Dongjoon Hyun commented on SPARK-39830:
---

Thank you, [~dzcxzl] .

> Reading ORC table that requires type promotion may throw AIOOBE
> ---
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Priority: Trivial
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
> location 'file:///tmp/test_table_orc_t1' stored as orc ")
> spark.sql("select * from test_table_orc_t1").show() {code}
> Querying this table will get the following exception
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
>         at 
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
>         at 
> org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>         at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
>         at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39822) Provides a good error during create Index with different dtype elements

2022-07-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39822:
-
Parent: SPARK-39581
Issue Type: Sub-task  (was: Bug)

> Provides a good error during create Index with different dtype elements
> ---
>
> Key: SPARK-39822
> URL: https://issues.apache.org/jira/browse/SPARK-39822
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.2
>Reporter: bo zhao
>Priority: Minor
>
> PANDAS
>  
> {code:java}
> >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], 
> >>> dtype='object') >>> 
>  {code}
> PYSPARK
>  
>  
> {code:java}
> Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
> Spark context Web UI available at http://172.25.179.45:4042
> Spark context available as 'sc' (master = local[*], app id = 
> local-1658301116572).
> SparkSession available as 'spark'.
> >>> from pyspark import pandas as ps
> WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It 
> is required to set this environment variable to '1' in both driver and 
> executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you 
> but it does not work if there is a Spark context already launched.
> >>> ps.Index([1,2,'3',4])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, 
> in __new__
>     ps.from_pandas(
>   File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in 
> from_pandas
>     return DataFrame(pd.DataFrame(index=pobj)).index
>   File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in 
> __init__
>     internal = InternalFrame.from_pandas(pdf)
>   File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in 
> from_pandas
>     ) = InternalFrame.prepare_pandas_frame(pdf, 
> prefer_timestamp_ntz=prefer_timestamp_ntz)
>   File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in 
> prepare_pandas_frame
>     spark_type = infer_pd_series_spark_type(reset_index[col], dtype, 
> prefer_timestamp_ntz)
>   File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 360, in infer_pd_series_spark_type
>     return from_arrow_type(pa.Array.from_pandas(pser).type, 
> prefer_timestamp_ntz)
>   File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to 
> convert to int64
>  {code}
> I understand that pyspark pandas need the dtype to be the same, but we need a 
> good error msg or something to tell the user how to avoid.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39791) In Spark 3.0 standalone cluster mode, unable to customize driver JVM path

2022-07-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569428#comment-17569428
 ] 

Hyukjin Kwon commented on SPARK-39791:
--

Is this a regression?

> In Spark 3.0 standalone cluster mode, unable to customize driver JVM path
> -
>
> Key: SPARK-39791
> URL: https://issues.apache.org/jira/browse/SPARK-39791
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Obobj
>Priority: Minor
>  Labels: spark-submit, standalone
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> In Spark 3.0 standalone mode, unable to customize driver JVM path, instead 
> the JAVA_HOME of the spark-submit submission machine is used, but the JVM 
> paths of my submission machine and the cluster machine are different
> {code:java}
> launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
> List buildJavaCommand(String extraClassPath) throws IOException {
>   List cmd = new ArrayList<>();
>   String firstJavaHome = firstNonEmpty(javaHome,
> childEnv.get("JAVA_HOME"),
> System.getenv("JAVA_HOME"),
> System.getProperty("java.home")); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39815) ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2022-07-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569426#comment-17569426
 ] 

Hyukjin Kwon commented on SPARK-39815:
--

Does this cause any actual issue? or just error log?

> ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> 
>
> Key: SPARK-39815
> URL: https://issues.apache.org/jira/browse/SPARK-39815
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
> Environment: Ubuntu 20.04
> Python 3.8.10
> Java 8
>  
>Reporter: Marzieh
>Priority: Major
>
> h3. I have a Spark SQL Program which run in Spark Cluster. Even though the 
> program is finished without any Error, after finishing State of the 
> Application becomes Killed. It shows this error on the log:
>  
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10686
> 22/07/19 13:14:50 INFO Executor: Running task 232.0 in stage 137.0 (TID 10686)
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10687
> 22/07/19 13:14:50 INFO Executor: Running task 233.0 in stage 137.0 (TID 10687)
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 6, boot = 2, init = 4, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 232.0 in stage 137.0 (TID 
> 10686). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 10, boot = 9, init = 1, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 233.0 in stage 137.0 (TID 
> 10687). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10688
> 22/07/19 13:14:50 INFO Executor: Running task 234.0 in stage 137.0 (TID 10688)
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10689
> 22/07/19 13:14:50 INFO Executor: Running task 235.0 in stage 137.0 (TID 10689)
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 1, boot = 1, init = 0, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 235.0 in stage 137.0 (TID 
> 10689). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10690
> 22/07/19 13:14:50 INFO Executor: Running task 236.0 in stage 137.0 (TID 10690)
> 22/07/19 13:14:50 WARN JdbcUtils: Requested isolation level 1 is not 
> supported; falling back to default isolation level
> 2 22/07/19 13:14:50 INFO PythonRunner: Times: total = 42, boot = -13, init = 
> 55, finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 231.0 in stage 137.0 (TID 
> 10685). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10691
> 22/07/19 13:14:50 INFO Executor: Running task 237.0 in stage 137.0 (TID 10691)
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 43, boot = -4, init = 47, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 234.0 in stage 137.0 (TID 
> 10688). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10692
> 22/07/19 13:14:50 INFO Executor: Running task 238.0 in stage 137.0 (TID 10692)
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 43, boot = 2, init = 41, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 236.0 in stage 137.0 (TID 
> 10690). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Got assigned task 10693
> 22/07/19 13:14:50 INFO Executor: Running task 239.0 in stage 137.0 (TID 10693)
> 22/07/19 13:14:50 INFO JDBCRDD: closed connection 22/07/19 13:14:50 INFO 
> PythonRunner: Times: total = 44, boot = 3, init = 41, finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 237.0 in stage 137.0 (TID 
> 10691). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 44, boot = 2, init = 42, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 238.0 in stage 137.0 (TID 
> 10692). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO Executor: Finished task 219.0 in stage 137.0 (TID 
> 10673). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO PythonRunner: Times: total = 42, boot = 2, init = 40, 
> finish = 0
> 22/07/19 13:14:50 INFO Executor: Finished task 239.0 in stage 137.0 (TID 
> 10693). 1785 bytes result sent to driver
> 22/07/19 13:14:50 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> 22/07/19 13:14:50 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39817) Missing sbin scripts in PySpark packages

2022-07-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569425#comment-17569425
 ] 

Hyukjin Kwon commented on SPARK-39817:
--

pip is designed for using it in Python. I would prefer to avoid people to 
create a Spark cluster by using pip.

> Missing sbin scripts in PySpark packages
> 
>
> Key: SPARK-39817
> URL: https://issues.apache.org/jira/browse/SPARK-39817
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: F. H.
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the PySpark setup.py, only a subset of all scripts is included.
> I'm in particular missing the `submit-all.sh` script:
> {code:python}
> package_data={
> 'pyspark.jars': ['*.jar'],
> 'pyspark.bin': ['*'],
> 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
>  'start-history-server.sh',
>  'stop-history-server.sh', ],
> [...]
> },
> {code}
>  
> The solution is super simple, just change 'pyspark.sbin' to:
> {code:python}
> 'pyspark.sbin': ['*'],
> {code}
>  
> I would happily submit a PR to github, but I have no clue on the 
> organizational details.
> This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39829) Upgrade log4j2 to 2.18.0

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39829:


Assignee: Dongjoon Hyun

> Upgrade log4j2 to 2.18.0
> 
>
> Key: SPARK-39829
> URL: https://issues.apache.org/jira/browse/SPARK-39829
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39829) Upgrade log4j2 to 2.18.0

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39829.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37242
[https://github.com/apache/spark/pull/37242]

> Upgrade log4j2 to 2.18.0
> 
>
> Key: SPARK-39829
> URL: https://issues.apache.org/jira/browse/SPARK-39829
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39831:
-
Fix Version/s: 3.1.4
   3.4.0
   3.3.1
   3.2.3

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569399#comment-17569399
 ] 

Apache Spark commented on SPARK-7837:
-

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37245

> NPE when save as parquet in speculative tasks
> -
>
> Key: SPARK-7837
> URL: https://issues.apache.org/jira/browse/SPARK-7837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.5.0
>
>
> The query is like {{df.orderBy(...).saveAsTable(...)}}.
> When there is no partitioning columns and there is a skewed key, I found the 
> following exception in speculative tasks. After these failures, seems we 
> could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
> {code}
> java.lang.NullPointerException
>   at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
>   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>   at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
>   at 
> org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569397#comment-17569397
 ] 

Apache Spark commented on SPARK-7837:
-

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37245

> NPE when save as parquet in speculative tasks
> -
>
> Key: SPARK-7837
> URL: https://issues.apache.org/jira/browse/SPARK-7837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.5.0
>
>
> The query is like {{df.orderBy(...).saveAsTable(...)}}.
> When there is no partitioning columns and there is a skewed key, I found the 
> following exception in speculative tasks. After these failures, seems we 
> could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
> {code}
> java.lang.NullPointerException
>   at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
>   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>   at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
>   at 
> org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39622:


Assignee: (was: Apache Spark)

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at org.scalatest.SuperEngine.runTestsInB

[jira] [Commented] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569396#comment-17569396
 ] 

Apache Spark commented on SPARK-39622:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37245

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   

[jira] [Assigned] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39622:


Assignee: Apache Spark

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at org.scalates

[jira] [Commented] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569395#comment-17569395
 ] 

Apache Spark commented on SPARK-39622:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37245

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   

[jira] [Commented] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569393#comment-17569393
 ] 

Apache Spark commented on SPARK-39831:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37243

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39831:


Assignee: Apache Spark

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39831:


Assignee: (was: Apache Spark)

> R dependencies installation start to fail after devtools_2.4.4 was released
> ---
>
> Key: SPARK-39831
> URL: https://issues.apache.org/jira/browse/SPARK-39831
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39831) R dependencies installation start to fail after devtools_2.4.4 was released

2022-07-21 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-39831:
-

 Summary: R dependencies installation start to fail after 
devtools_2.4.4 was released
 Key: SPARK-39831
 URL: https://issues.apache.org/jira/browse/SPARK-39831
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE

2022-07-21 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569385#comment-17569385
 ] 

dzcxzl commented on SPARK-39830:


cc @[~dongjoon]

> Reading ORC table that requires type promotion may throw AIOOBE
> ---
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Priority: Trivial
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
> location 'file:///tmp/test_table_orc_t1' stored as orc ")
> spark.sql("select * from test_table_orc_t1").show() {code}
> Querying this table will get the following exception
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
>         at 
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
>         at 
> org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>         at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
>         at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE

2022-07-21 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-39830:
---
Description: 
We can add a UT to test the scenario after the ORC-1205 release.

 

bin/spark-shell
{code:java}
spark.sql("set orc.stripe.size=10240")
spark.sql("set orc.rows.between.memory.checks=1")
spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
val df = spark.range(1, 1+512, 1, 1).map { i =>
    if( i == 1 ){
        (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
    } else {
        (i,Array.fill[Byte](1)('X'))
    }
    }.toDF("c1","c2")
df.write.format("orc").save("file:///tmp/test_table_orc_t1")
spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
location 'file:///tmp/test_table_orc_t1' stored as orc ")
spark.sql("select * from test_table_orc_t1").show() {code}
Querying this table will get the following exception
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 1
        at 
org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
        at 
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
        at 
org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
        at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
        at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 {code}
 

  was:
 
{code:java}
spark.sql("set orc.stripe.size=10240")
spark.sql("set orc.rows.between.memory.checks=1")
spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
val df = spark.range(1, 1+512, 1, 1).map { i =>
    if( i == 1 ){
        (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
    } else {
        (i,Array.fill[Byte](1)('X'))
    }
    }.toDF("c1","c2")
df.write.format("orc").save("file:///tmp/test_table_orc_t1")
spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
location 'file:///tmp/test_table_orc_t1' stored as orc ")
spark.sql("select * from test_table_orc_t1").show() {code}
Querying this table will get the following exception

 
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 1
        at 
org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
        at 
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
        at 
org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
        at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
        at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 {code}
 

 

We can add a UT to test the scenario after the 
[ORC-1205|https://issues.apache.org/jira/browse/ORC-1205] release


> Reading ORC table that requires type promotion may throw AIOOBE
> ---
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Priority: Trivial
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external

[jira] [Created] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE

2022-07-21 Thread dzcxzl (Jira)
dzcxzl created SPARK-39830:
--

 Summary: Reading ORC table that requires type promotion may throw 
AIOOBE
 Key: SPARK-39830
 URL: https://issues.apache.org/jira/browse/SPARK-39830
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: dzcxzl


 
{code:java}
spark.sql("set orc.stripe.size=10240")
spark.sql("set orc.rows.between.memory.checks=1")
spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
val df = spark.range(1, 1+512, 1, 1).map { i =>
    if( i == 1 ){
        (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
    } else {
        (i,Array.fill[Byte](1)('X'))
    }
    }.toDF("c1","c2")
df.write.format("orc").save("file:///tmp/test_table_orc_t1")
spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
location 'file:///tmp/test_table_orc_t1' stored as orc ")
spark.sql("select * from test_table_orc_t1").show() {code}
Querying this table will get the following exception

 
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 1
        at 
org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
        at 
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
        at 
org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
        at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
        at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 {code}
 

 

We can add a UT to test the scenario after the 
[ORC-1205|https://issues.apache.org/jira/browse/ORC-1205] release



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38597) Enable resource limited spark k8s IT in GA

2022-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569382#comment-17569382
 ] 

Apache Spark commented on SPARK-38597:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37244

> Enable resource limited spark k8s IT in GA
> --
>
> Key: SPARK-38597
> URL: https://issues.apache.org/jira/browse/SPARK-38597
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38691) Use error classes in the compilation errors of column/attr resolving

2022-07-21 Thread Goutam Ghosh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569352#comment-17569352
 ] 

Goutam Ghosh commented on SPARK-38691:
--

I  am working on this

> Use error classes in the compilation errors of column/attr resolving
> 
>
> Key: SPARK-38691
> URL: https://issues.apache.org/jira/browse/SPARK-38691
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * cannotResolveUserSpecifiedColumnsError
> * cannotResolveStarExpandGivenInputColumnsError
> * cannotResolveAttributeError
> * cannotResolveColumnGivenInputColumnsError
> * cannotResolveColumnNameAmongAttributesError
> * cannotResolveColumnNameAmongFieldsError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39827) add_months() returns a java error on overflow

2022-07-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39827.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37240
[https://github.com/apache/spark/pull/37240]

> add_months() returns a java error on overflow
> -
>
> Key: SPARK-39827
> URL: https://issues.apache.org/jira/browse/SPARK-39827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> The code below throws an Java exception, see
> {code:java}
> spark.sql("SET spark.sql.ansi.enabled=true").show()spark.sql("SELECT 
> add_months('550-12-31', 1000)").show()java.lang.ArithmeticException: 
> integer overflow  at java.base/java.lang.Math.toIntExact(Math.java:1074)  at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.localDateToDays(DateTimeUtils.scala:550)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.dateAddMonths(DateTimeUtils.scala:736)
>  {code}
> but it should throw Spark's exception w/ an error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39622) ParquetIOSuite fails intermittently on master branch

2022-07-21 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569324#comment-17569324
 ] 

Yang Jie commented on SPARK-39622:
--

Maybe the test suite started flaky after SPARK-39195 was merged. I revert it 
and ran "SPARK-7837 Do not close output writer twice when commitTask() fails" 
dozens of times without failure. Still  investigate  the root cause.

[~kabhwan] [~hyukjin.kwon] [~cloud_fan] 

> ParquetIOSuite fails intermittently on master branch
> 
>
> Key: SPARK-39622
> URL: https://issues.apache.org/jira/browse/SPARK-39622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> "SPARK-7837 Do not close output writer twice when commitTask() fails" in 
> ParquetIOSuite fails intermittently with master branch. 
> Assertion error follows:
> {code}
> "Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
> stage=1, partition=0) failed; but task commit success, data duplication may 
> happen." did not contain "Intentional exception for testing purposes"
> ScalaTestFailureLocation: 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite at 
> (ParquetIOSuite.scala:1216)
> org.scalatest.exceptions.TestFailedException: "Job aborted due to stage 
> failure: Authorized committer (attemptNumber=0, stage=1, partition=0) failed; 
> but task commit success, data duplication may happen." did not contain 
> "Intentional exception for testing purposes"
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259(ParquetIOSuite.scala:1216)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$259$adapted(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$256(ParquetIOSuite.scala:1209)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:247)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:245)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.withSQLConf(ParquetIOSuite.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite.$anonfun$new$255(ParquetIOSuite.scala:1190)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
>   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233)
> 

[jira] [Resolved] (SPARK-39469) Infer date type for CSV schema inference

2022-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39469.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36871
[https://github.com/apache/spark/pull/36871]

> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Priority: Major
> Fix For: 3.4.0
>
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38916) Tasks not killed caused by race conditions between killTask() and launchTask()

2022-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-38916:

Fix Version/s: (was: 3.2.2)

> Tasks not killed caused by race conditions between killTask() and launchTask()
> --
>
> Key: SPARK-38916
> URL: https://issues.apache.org/jira/browse/SPARK-38916
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.3.0
>
>
> Sometimes when the scheduler tries to cancel a task right after it launches 
> that task on the executor, the KillTask and LaunchTask events can come in a 
> reversed order, causing the task to escape the kill-task signal and finish 
> "secretly". And those tasks even show as an un-launched task in Spark UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39469) Infer date type for CSV schema inference

2022-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39469:
---

Assignee: Jonathan Cui

> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Assignee: Jonathan Cui
>Priority: Major
> Fix For: 3.4.0
>
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org