from:"Chao Sun \(Jira\)"

[jira] [Created] (SPARK-38840) Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default

2022-04-08 Thread Chao Sun (Jira)

Chao Sun created SPARK-38840:


 Summary: Enable 
spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default
 Key: SPARK-38840
 URL: https://issues.apache.org/jira/browse/SPARK-38840
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Chao Sun


We can enable {{spark.sql.parquet.enableNestedColumnVectorizedReader}} on 
master branch by default, to make sure it is covered by more tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"

2022-04-05 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38786.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36075
[https://github.com/apache/spark/pull/36075]

> Test Bug in StatisticsSuite "change stats after add/drop partition command"
> ---
>
> Key: SPARK-38786
> URL: https://issues.apache.org/jira/browse/SPARK-38786
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.4.0
>
>
> [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979]
> It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste 
> bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"

2022-04-05 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-38786:


Assignee: Kazuyuki Tanimura

> Test Bug in StatisticsSuite "change stats after add/drop partition command"
> ---
>
> Key: SPARK-38786
> URL: https://issues.apache.org/jira/browse/SPARK-38786
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979]
> It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste 
> bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34863) Support nested column in Spark Parquet vectorized readers

2022-04-04 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-34863:


Assignee: Chao Sun  (was: Apache Spark)

> Support nested column in Spark Parquet vectorized readers
> -
>
> Key: SPARK-34863
> URL: https://issues.apache.org/jira/browse/SPARK-34863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.3.0
>
>
> The task is to support nested column type in Spark Parquet vectorized reader. 
> Currently Parquet vectorized reader does not support nested column type 
> (struct, array and map). We implemented nested column vectorized reader for 
> FB-ORC in our internal fork of Spark. We are seeing performance improvement 
> compared to non-vectorized reader when reading nested columns. In addition, 
> this can also help improve the non-nested column performance when reading 
> non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog

2022-04-04 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37378:
-
Fix Version/s: 3.4.0

> Convert V2 Transform expressions into catalyst expressions and load their 
> associated functions from V2 FunctionCatalog
> --
>
> Key: SPARK-37378
> URL: https://issues.apache.org/jira/browse/SPARK-37378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> We need to add logic to convert a V2 {{Transform}} expression into its 
> catalyst expression counterpart, and also load its function definition from 
> the V2 FunctionCatalog provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog

2022-04-04 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37378.
--
Resolution: Duplicate

This JIRA is covered as part of SPARK-37377

> Convert V2 Transform expressions into catalyst expressions and load their 
> associated functions from V2 FunctionCatalog
> --
>
> Key: SPARK-37378
> URL: https://issues.apache.org/jira/browse/SPARK-37378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> We need to add logic to convert a V2 {{Transform}} expression into its 
> catalyst expression counterpart, and also load its function definition from 
> the V2 FunctionCatalog provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join

2022-04-04 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37377:
-
Description: This Jira tracks the initial implementation of 
storage-partitioned join.  (was: Currently {{Partitioning}} is defined as 
follow:
{code:scala}
@Evolving
public interface Partitioning {
  int numPartitions();
  boolean satisfy(Distribution distribution);
}
{code}

There are two issues with the interface: 1) it uses a deprecated 
{{Distribution}} interface, and should switch to 
{{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
there is no way to use this in join where we want to compare reported 
partitionings from both sides and decide whether they are "compatible" (and 
thus allows Spark to eliminate shuffle). )

> Initial implementation of Storage-Partitioned Join
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> This Jira tracks the initial implementation of storage-partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join

2022-04-04 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37377:
-
Summary: Initial implementation of Storage-Partitioned Join  (was: Refactor 
V2 Partitioning interface and remove deprecated usage of Distribution)

> Initial implementation of Storage-Partitioned Join
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently {{Partitioning}} is defined as follow:
> {code:scala}
> @Evolving
> public interface Partitioning {
>   int numPartitions();
>   boolean satisfy(Distribution distribution);
> }
> {code}
> There are two issues with the interface: 1) it uses a deprecated 
> {{Distribution}} interface, and should switch to 
> {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
> there is no way to use this in join where we want to compare reported 
> partitionings from both sides and decide whether they are "compatible" (and 
> thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-31 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37974:
-
Fix Version/s: 3.3.0
   (was: 3.4.0)

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-31 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37974.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35262
[https://github.com/apache/spark/pull/35262]

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-31 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37974:


Assignee: Parth Chandra

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36679) Remove lz4 hadoop wrapper classes after Hadoop 3.3.2

2022-03-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-36679.
--
Fix Version/s: 3.3.0
   Resolution: Duplicate

> Remove lz4 hadoop wrapper classes after Hadoop 3.3.2
> 
>
> Key: SPARK-36679
> URL: https://issues.apache.org/jira/browse/SPARK-36679
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Lz4-java as provided dependency is not correctly excluded from relocation in 
> Hadoop shaded client libraries in Hadoop 3.3.1. (HADOOP-17891)
>  
> In order to deal the issue without reverting back to non-shade client 
> libraries, we add a few Lz4 Hadoop wrapper classes `LZ4Factory`, 
> `LZ4Compressor`, and `LZ4SafeDecompressor`, under the package 
> `org.apache.hadoop.shaded.net.jpountz.lz4`.
> We should remove these wrapper classes after Hadoop 3.3.2 release which 
> should include the fix.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38179) Improve WritableColumnVector to better support null struct

2022-03-03 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38179.
--
Resolution: Won't Fix

> Improve WritableColumnVector to better support null struct
> --
>
> Key: SPARK-38179
> URL: https://issues.apache.org/jira/browse/SPARK-38179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently {{WritableColumnVector}} of struct type requires to allocate space 
> in all child vectors for null elements. This is not very space efficient. In 
> addition, this model doesn't work well with Parquet vectorized scan for 
> struct (in SPARK-34863).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38237) Introduce a new config to require all cluster keys on Aggregate

2022-02-25 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-38237:


Assignee: Cheng Su

> Introduce a new config to require all cluster keys on Aggregate
> ---
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Cheng Su
>Priority: Major
> Fix For: 3.3.0
>
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to introduce a new config to require all cluster keys on 
> Aggregate, leveraging HashClusteredDistribution. That said, we propose to 
> rename back HashClusteredDistribution with retaining NOTE for stateful 
> operator. The distribution should not be still touched anyway due to the 
> requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38237) Introduce a new config to require all cluster keys on Aggregate

2022-02-25 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38237.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35574
[https://github.com/apache/spark/pull/35574]

> Introduce a new config to require all cluster keys on Aggregate
> ---
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
> Fix For: 3.3.0
>
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to introduce a new config to require all cluster keys on 
> Aggregate, leveraging HashClusteredDistribution. That said, we propose to 
> rename back HashClusteredDistribution with retaining NOTE for stateful 
> operator. The distribution should not be still touched anyway due to the 
> requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38179) Improve WritableColumnVector to better support null struct

2022-02-10 Thread Chao Sun (Jira)

Chao Sun created SPARK-38179:


 Summary: Improve WritableColumnVector to better support null struct
 Key: SPARK-38179
 URL: https://issues.apache.org/jira/browse/SPARK-38179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently {{WritableColumnVector}} of struct type requires to allocate space in 
all child vectors for null elements. This is not very space efficient. In 
addition, this model doesn't work well with Parquet vectorized scan for struct 
(in SPARK-34863).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38077) Spark 3.2.1 breaks binary compatibility with Spark 3.2.0

2022-01-31 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484894#comment-17484894
 ] 

Chao Sun commented on SPARK-38077:
--

BTW [~thesamet] it seems Spark only guarantees API compatibility, not binary 
compatibility across versions. See 
https://spark.apache.org/versioning-policy.html

> Spark 3.2.1 breaks binary compatibility with Spark 3.2.0
> 
>
> Key: SPARK-38077
> URL: https://issues.apache.org/jira/browse/SPARK-38077
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Nadav Samet
>Priority: Major
>
> [PR 35243|https://github.com/apache/spark/pull/35243] introduced a new 
> parameter to class `Invoke` with a default value (`isDeterministic: Boolean = 
> true`). Existing Spark libraries (such as 
> [frameless|https://github.com/typelevel/frameless]) that invoke 
> [Invoke|https://github.com/typelevel/frameless/blob/29961d549e332dddf5cd711ef699dde7460cc48a/dataset/src/main/scala/frameless/RecordEncoder.scala#L154]
>  directly expect a method with 7 parameters, and the new version expects 8. 
> If Frameless would recompile with Spark 3.2.1, the updated library will NOT 
> be binary compatible with Spark 3.2.0. Adding default parameters to existing 
> methods [should be 
> avoided|https://github.com/jatcwang/binary-compatibility-guide#dont-adding-parameters-with-default-values-to-methods].
> One way forward would be to revert the change in the constructor and 
> introduce a second constructor or a companion method that takes all the 8 
> parameters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38077) Spark 3.2.1 breaks binary compatibility with Spark 3.2.0

2022-01-31 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484873#comment-17484873
 ] 

Chao Sun commented on SPARK-38077:
--

Sorry for breaking the binary compatibility. I wasn't aware that `Invoke` is 
used by other libraries outside Spark and was merely following how other 
parameters are defined (namely `propagateNull` and `returnNullable`). Let me 
work on a PR to fix it.

> Spark 3.2.1 breaks binary compatibility with Spark 3.2.0
> 
>
> Key: SPARK-38077
> URL: https://issues.apache.org/jira/browse/SPARK-38077
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Nadav Samet
>Priority: Major
>
> [PR 35243|https://github.com/apache/spark/pull/35243] introduced a new 
> parameter to class `Invoke` with a default value (`isDeterministic: Boolean = 
> true`). Existing Spark libraries (such as 
> [frameless|https://github.com/typelevel/frameless]) that invoke 
> [Invoke|https://github.com/typelevel/frameless/blob/29961d549e332dddf5cd711ef699dde7460cc48a/dataset/src/main/scala/frameless/RecordEncoder.scala#L154]
>  directly expect a method with 7 parameters, and the new version expects 8. 
> If Frameless would recompile with Spark 3.2.1, the updated library will NOT 
> be binary compatible with Spark 3.2.0. Adding default parameters to existing 
> methods [should be 
> avoided|https://github.com/jatcwang/binary-compatibility-guide#dont-adding-parameters-with-default-values-to-methods].
> One way forward would be to revert the change in the constructor and 
> introduce a second constructor or a companion method that takes all the 8 
> parameters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4

2022-01-27 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483399#comment-17483399
 ] 

Chao Sun commented on SPARK-37994:
--

Glad it helped [~tanvu]!

{quote}
We can omit the
-Dcurator.version=2.13.0 -Dcommons-io.version=2.8.0
part, though
{quote}

Yea perhaps. I added them here to just keep the version in sync with what being 
used by Hadoop 3.x

It's annoying that we have to make it compile in this way though. Let me think 
whether I should resume SPARK-35959 and add a Maven profile for this.

> Unable to build spark3.2 with -Dhadoop.version=3.1.4
> 
>
> Key: SPARK-37994
> URL: https://issues.apache.org/jira/browse/SPARK-37994
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vu Tan
>Priority: Minor
>
> I downloaded Spark 3.2 sourcecode from 
> [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip]
> and try building with the below command 
> {code:java}
> ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr 
> -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 
> -Pkubernetes {code}
> Then it gives the below error 
> {code:java}
> [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiler bridge file: 
> /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar
> [INFO] compiler plugin: 
> BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null)
> [INFO] Compiling 567 Scala sources and 104 Java sources to 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ...
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778:
>  not found: type ArrayWritable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777:
>  not found: type Writable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>  object security is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284:
>  not found: value UserGroupInformation
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39:
>  object mapred is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:348:
>  not found: type Credenti

[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4

2022-01-26 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482632#comment-17482632
 ] 

Chao Sun commented on SPARK-37994:
--

[~tanvu] Hmm in that case maybe you can try:
{code}
./dev/make-distribution.sh --name without-hadoop --pip --tgz -Psparkr -Phive 
-Phive-thriftserver -Phadoop-provided -Pyarn \
  -Dhadoop.version=3.1.4 -Phadoop-2.7 -Dcurator.version=2.13.0 
-Dcommons-io.version=2.8.0
{code}

I tried it and it seems to work.

> -Dhadoop-client-runtime.artifact should be hadoop-client, not hadoop-yarn-api

That PR is outdated. We switched to use hadoop-yarn-api in order to avoid the 
exact issue around dependency-reduced-pom.xml you mentioned above.


> Unable to build spark3.2 with -Dhadoop.version=3.1.4
> 
>
> Key: SPARK-37994
> URL: https://issues.apache.org/jira/browse/SPARK-37994
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vu Tan
>Priority: Minor
>
> I downloaded Spark 3.2 sourcecode from 
> [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip]
> and try building with the below command 
> {code:java}
> ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr 
> -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 
> -Pkubernetes {code}
> Then it gives the below error 
> {code:java}
> [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiler bridge file: 
> /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar
> [INFO] compiler plugin: 
> BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null)
> [INFO] Compiling 567 Scala sources and 104 Java sources to 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ...
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778:
>  not found: type ArrayWritable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777:
>  not found: type Writable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>  object security is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284:
>  not found: value UserGroupInformation
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39:
>  object mapred is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36:
>  object conf is not a member of package org.apache.

[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4

2022-01-24 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481327#comment-17481327
 ] 

Chao Sun commented on SPARK-37994:
--

I considered to add a new Maven profile for Hadoop versions <= 2.x (see 
SPARK-35959) but abandoned it due to lack of interest. I could pick it up a 
again if people think it is a good idea.

> Unable to build spark3.2 with -Dhadoop.version=3.1.4
> 
>
> Key: SPARK-37994
> URL: https://issues.apache.org/jira/browse/SPARK-37994
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vu Tan
>Priority: Minor
>
> I downloaded Spark 3.2 sourcecode from 
> [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip]
> and try building with the below command 
> {code:java}
> ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr 
> -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 
> -Pkubernetes {code}
> Then it gives the below error 
> {code:java}
> [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiler bridge file: 
> /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar
> [INFO] compiler plugin: 
> BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null)
> [INFO] Compiling 567 Scala sources and 104 Java sources to 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ...
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778:
>  not found: type ArrayWritable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777:
>  not found: type Writable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>  object security is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284:
>  not found: value UserGroupInformation
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39:
>  object mapred is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:348:
>  not found: type Credentials
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:350:
>  not found: value UserGroupInformation
> [ERROR]

[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4

2022-01-24 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481326#comment-17481326
 ] 

Chao Sun commented on SPARK-37994:
--

Yes, thanks [~xkrogen] for pinging me. [~tanvu]: can you try this command 
instead?

{code}
./dev/make-distribution.sh --name without-hadoop --pip --tgz -Psparkr -Phive 
-Phive-thriftserver -Phadoop-provided -Pyarn \
  -Dhadoop.version=3.1.4 -Pkubernetes \
  -Dhadoop-client-api.artifact=hadoop-client \
  -Dhadoop-client-runtime.artifact=hadoop-yarn-api \
  -Dhadoop-client-minicluster.artifact=hadoop-client
{code}

> Unable to build spark3.2 with -Dhadoop.version=3.1.4
> 
>
> Key: SPARK-37994
> URL: https://issues.apache.org/jira/browse/SPARK-37994
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Vu Tan
>Priority: Minor
>
> I downloaded Spark 3.2 sourcecode from 
> [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip]
> and try building with the below command 
> {code:java}
> ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr 
> -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 
> -Pkubernetes {code}
> Then it gives the below error 
> {code:java}
> [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiler bridge file: 
> /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar
> [INFO] compiler plugin: 
> BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null)
> [INFO] Compiling 567 Scala sources and 104 Java sources to 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ...
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778:
>  not found: type ArrayWritable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777:
>  not found: type Writable
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
>  object io is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>  object security is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121:
>  not found: type Configuration
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284:
>  not found: value UserGroupInformation
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40:
>  object mapreduce is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39:
>  object mapred is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37:
>  object fs is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36:
>  object conf is not a member of package org.apache.hadoop
> [ERROR] [Error] 
> /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityMana

[jira] [Updated] (SPARK-37957) Deterministic flag is not handled for V2 functions

2022-01-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37957:
-
Fix Version/s: 3.2.1

> Deterministic flag is not handled for V2 functions
> --
>
> Key: SPARK-37957
> URL: https://issues.apache.org/jira/browse/SPARK-37957
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37957) Deterministic flag is not handled for V2 functions

2022-01-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37957.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35243
[https://github.com/apache/spark/pull/35243]

> Deterministic flag is not handled for V2 functions
> --
>
> Key: SPARK-37957
> URL: https://issues.apache.org/jira/browse/SPARK-37957
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark

2022-01-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37928:


Assignee: Yang Jie

> Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
> --
>
> Key: SPARK-37928
> URL: https://issues.apache.org/jira/browse/SPARK-37928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark

2022-01-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37928.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35226
[https://github.com/apache/spark/pull/35226]

> Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
> --
>
> Key: SPARK-37928
> URL: https://issues.apache.org/jira/browse/SPARK-37928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37957) Deterministic flag is not handled for V2 functions

2022-01-18 Thread Chao Sun (Jira)

Chao Sun created SPARK-37957:


 Summary: Deterministic flag is not handled for V2 functions
 Key: SPARK-37957
 URL: https://issues.apache.org/jira/browse/SPARK-37957
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun
Assignee: Chao Sun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37864) Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path

2022-01-13 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37864.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35163
[https://github.com/apache/spark/pull/35163]

> Support Parquet v2 data page RLE encoding (for Boolean Values) for the 
> vectorized path
> --
>
> Key: SPARK-37864
> URL: https://issues.apache.org/jira/browse/SPARK-37864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> Parquet v2 data page write Boolean Values use RLE encoding, when read v2 
> boolean type values it will throw exceptions as follows now:
>  
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:48)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:250)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:237)
>  ~[classes/:?]
>     at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) 
> ~[parquet-column-1.12.2.jar:1.12.2]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:237)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:173)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:298)
>  ~[classes/:?]
>     ... 19 more {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37864) Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path

2022-01-13 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37864:


Assignee: Yang Jie

> Support Parquet v2 data page RLE encoding (for Boolean Values) for the 
> vectorized path
> --
>
> Key: SPARK-37864
> URL: https://issues.apache.org/jira/browse/SPARK-37864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> Parquet v2 data page write Boolean Values use RLE encoding, when read v2 
> boolean type values it will throw exceptions as follows now:
>  
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:48)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:250)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:237)
>  ~[classes/:?]
>     at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) 
> ~[parquet-column-1.12.2.jar:1.12.2]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:237)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:173)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>  ~[classes/:?]
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:298)
>  ~[classes/:?]
>     ... 19 more {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2022-01-05 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-36879:


Assignee: Parth Chandra

> Support Parquet v2 data page encodings for the vectorized path
> --
>
> Key: SPARK-36879
> URL: https://issues.apache.org/jira/browse/SPARK-36879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently Spark only support Parquet V1 encodings (i.e., 
> PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
> {code}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
> {code}
> It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
> DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
> listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2022-01-05 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-36879.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34471
[https://github.com/apache/spark/pull/34471]

> Support Parquet v2 data page encodings for the vectorized path
> --
>
> Key: SPARK-36879
> URL: https://issues.apache.org/jira/browse/SPARK-36879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently Spark only support Parquet V1 encodings (i.e., 
> PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
> {code}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
> {code}
> It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
> DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
> listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37633:
-
Affects Version/s: (was: 3.0.3)

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37633:


Assignee: Manu Zhang

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37633.
--
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 34888
[https://github.com/apache/spark/pull/34888]

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-14 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37217:
-
Fix Version/s: 3.2.1

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.2.1, 3.3.0
>
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting

2021-12-13 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37481:
-
Fix Version/s: 3.2.1
   (was: 3.2.0)

> Disappearance of skipped stages mislead the bug hunting 
> 
>
> Key: SPARK-37481
> URL: https://issues.apache.org/jira/browse/SPARK-37481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> # 
>  ## With FetchFailedException and Map Stage Retries
> When rerunning spark-sql shell with the original SQL in 
> [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315]
> !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png!
> 1. stage 3 threw FetchFailedException and caused itself and its parent 
> stage(stage 2) to retry
> 2. stage 2 was skipped before but its attemptId was still 0, so when its 
> retry happened it got removed from `Skipped Stages` 
> The DAG of Job 2 doesn't show that stage 2 is skipped anymore.
> !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png!
> Besides, a retried stage usually has a subset of tasks from the original 
> stage. If we mark it as an original one, the metrics might lead us into 
> pitfalls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-13 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37217:


Assignee: dzcxzl

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-13 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37217.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34493
[https://github.com/apache/spark/pull/34493]

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4

2021-12-09 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37573.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34830
[https://github.com/apache/spark/pull/34830]

> IsolatedClient  fallbackVersion should be build in version, not always 2.7.4
> 
>
> Key: SPARK-37573
> URL: https://issues.apache.org/jira/browse/SPARK-37573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Hadoop 3 fallback to 2.7.4 cause error
> {code}
> [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 
> seconds, 320 milliseconds)
> [info]   java.lang.ClassFormatError: Truncated class file
> [info]   at java.lang.ClassLoader.defineClass1(Native Method)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:635)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:405)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313)
> [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> [info]   at 
> org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4

2021-12-09 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37573:


Assignee: angerszhu

> IsolatedClient  fallbackVersion should be build in version, not always 2.7.4
> 
>
> Key: SPARK-37573
> URL: https://issues.apache.org/jira/browse/SPARK-37573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Hadoop 3 fallback to 2.7.4 cause error
> {code}
> [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 
> seconds, 320 milliseconds)
> [info]   java.lang.ClassFormatError: Truncated class file
> [info]   at java.lang.ClassLoader.defineClass1(Native Method)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:635)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:405)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313)
> [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> [info]   at 
> org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37600) Upgrade to Hadoop 3.3.2

2021-12-09 Thread Chao Sun (Jira)

Chao Sun created SPARK-37600:


 Summary: Upgrade to Hadoop 3.3.2
 Key: SPARK-37600
 URL: https://issues.apache.org/jira/browse/SPARK-37600
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


Upgrade Spark to use Hadoop 3.3.2 once it's released.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken

2021-12-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37561:


Assignee: dzcxzl

> Avoid loading all functions when obtaining hive's DelegationToken
> -
>
> Key: SPARK-37561
> URL: https://issues.apache.org/jira/browse/SPARK-37561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Attachments: getDelegationToken_load_functions.png
>
>
> At present, when obtaining the delegationToken of hive, all functions will be 
> loaded.
> This is unnecessary, it takes time to load the function, and it also 
> increases the burden on the hive meta store.
>  
> !getDelegationToken_load_functions.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken

2021-12-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37561.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34822
[https://github.com/apache/spark/pull/34822]

> Avoid loading all functions when obtaining hive's DelegationToken
> -
>
> Key: SPARK-37561
> URL: https://issues.apache.org/jira/browse/SPARK-37561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
> Attachments: getDelegationToken_load_functions.png
>
>
> At present, when obtaining the delegationToken of hive, all functions will be 
> loaded.
> This is unnecessary, it takes time to load the function, and it also 
> increases the burden on the hive meta store.
>  
> !getDelegationToken_load_functions.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-12-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37205.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34635
[https://github.com/apache/spark/pull/34635]

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-12-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37205:


Assignee: Chao Sun

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37445) Update hadoop-profile

2021-12-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37445:


Assignee: angerszhu

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37445) Update hadoop-profile

2021-12-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37445.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34715
[https://github.com/apache/spark/pull/34715]

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2021-12-03 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36529:
-
Attachment: (was: image.png)

> Decouple CPU with IO work in vectorized Parquet reader
> --
>
> Key: SPARK-36529
> URL: https://issues.apache.org/jira/browse/SPARK-36529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a 
> sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like 
> S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. 
> In addition, we could parallelize the row group loading and column decoding, 
> and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2021-12-03 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36529:
-
Attachment: image.png

> Decouple CPU with IO work in vectorized Parquet reader
> --
>
> Key: SPARK-36529
> URL: https://issues.apache.org/jira/browse/SPARK-36529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a 
> sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like 
> S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. 
> In addition, we could parallelize the row group loading and column decoding, 
> and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-29 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-35867.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34611
[https://github.com/apache/spark/pull/34611]

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-29 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-35867:


Assignee: Kazuyuki Tanimura

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog

2021-11-18 Thread Chao Sun (Jira)

Chao Sun created SPARK-37378:


 Summary: Convert V2 Transform expressions into catalyst 
expressions and load their associated functions from V2 FunctionCatalog
 Key: SPARK-37378
 URL: https://issues.apache.org/jira/browse/SPARK-37378
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


We need to add logic to convert a V2 {{Transform}} expression into its catalyst 
expression counterpart, and also load its function definition from the V2 
FunctionCatalog provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution

2021-11-18 Thread Chao Sun (Jira)

Chao Sun created SPARK-37377:


 Summary: Refactor V2 Partitioning interface and remove deprecated 
usage of Distribution
 Key: SPARK-37377
 URL: https://issues.apache.org/jira/browse/SPARK-37377
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently {{Partitioning}} is defined as follow:
{code:scala}
@Evolving
public interface Partitioning {
  int numPartitions();
  boolean satisfy(Distribution distribution);
}
{code}

There are two issues with the interface: 1) it uses a deprecated 
{{Distribution}} interface, and should switch to 
{{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
there is no way to use this in join where we want to compare reported 
partitionings from both sides and decide whether they are "compatible" (and 
thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37376) Introduce a new DataSource V2 interface HasPartitionKey

2021-11-18 Thread Chao Sun (Jira)

Chao Sun created SPARK-37376:


 Summary: Introduce a new DataSource V2 interface HasPartitionKey 
 Key: SPARK-37376
 URL: https://issues.apache.org/jira/browse/SPARK-37376
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


One of the pre-requisite for the feature is to allow V2 input partitions to 
report their partition values to Spark, which can use them to compare if both 
sides of join are co-partitioned, and also optionally group input partitions 
together.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-18 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37166:
-
Parent: SPARK-37375
Issue Type: Sub-task  (was: New Feature)

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37375) Umbrella: Storage Partitioned Join

2021-11-18 Thread Chao Sun (Jira)

Chao Sun created SPARK-37375:


 Summary: Umbrella: Storage Partitioned Join
 Key: SPARK-37375
 URL: https://issues.apache.org/jira/browse/SPARK-37375
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


This umbrella JIRA tracks the progress of implementing Storage Partitioned Join 
feature for Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-18 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37166.
--
Fix Version/s: 3.3.0
 Assignee: Chao Sun
   Resolution: Fixed

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37342:
-
Component/s: Build
 (was: Spark Core)

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Chao Sun (Jira)

Chao Sun created SPARK-37342:


 Summary: Upgrade Apache Arrow to 6.0.0
 Key: SPARK-37342
 URL: https://issues.apache.org/jira/browse/SPARK-37342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Chao Sun


Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37239.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34520
[https://github.com/apache/spark/pull/34520]

> Avoid unnecessary `setReplication` in Yarn mode
> ---
>
> Key: SPARK-37239
> URL: https://issues.apache.org/jira/browse/SPARK-37239
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: wang-zhun
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> We found a large number of replication logs in hdfs server   
> ```
> 2021-11-04,17:22:13,065 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip
> 2021-11-04,17:22:13,069 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip
> 2021-11-04,17:22:13,070 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip
> ```
> https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439
>   
> `setReplication` needs to acquire write lock, we should reduce this 
> unnecessary operation



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37239:


Assignee: Yang Jie

> Avoid unnecessary `setReplication` in Yarn mode
> ---
>
> Key: SPARK-37239
> URL: https://issues.apache.org/jira/browse/SPARK-37239
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: wang-zhun
>Assignee: Yang Jie
>Priority: Major
>
> We found a large number of replication logs in hdfs server   
> ```
> 2021-11-04,17:22:13,065 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip
> 2021-11-04,17:22:13,069 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip
> 2021-11-04,17:22:13,070 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip
> ```
> https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439
>   
> `setReplication` needs to acquire write lock, we should reduce this 
> unnecessary operation



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-11-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35437:
-
Priority: Major  (was: Minor)

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Major
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-11-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-35437.
--
Resolution: Fixed

Issue resolved by pull request 34431
[https://github.com/apache/spark/pull/34431]

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-11-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-35437:


Assignee: dzcxzl

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36998) Handle concurrent eviction of same application in SHS

2021-11-07 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440066#comment-17440066
 ] 

Chao Sun commented on SPARK-36998:
--

Fixed

> Handle concurrent eviction of same application in SHS
> -
>
> Key: SPARK-36998
> URL: https://issues.apache.org/jira/browse/SPARK-36998
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Assignee: Thejdeep Gudivada
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> SHS throws this exception when trying to make room for parsing of a log file. 
> Reason for this is - there is a race condition to make space for processing 
> of two log files and the deleteDirectory method is overlapping.
> {code:java}
> 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause 
> usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN 
> HttpChannel: handleException 
> /api/v1/applications/application_1632281309592_2767775/1/jobs 
> java.io.IOException : Unable to delete directory 
> /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb.
>  21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError 
> javax.servlet.ServletException: 
> org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable 
> to delete directory /grid 
> /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) 
> at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) 
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
>  at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) 
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
>  at 
> org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>  at org.sparkproject.jetty.server.Server.handle(Server.java:516) at 
> org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>  at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) 
> at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at 
> org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
>  at 
> org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>  at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at 
> org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
>  at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36998) Handle concurrent eviction of same application in SHS

2021-11-07 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-36998:


Assignee: Thejdeep Gudivada  (was: Thejdeep)

> Handle concurrent eviction of same application in SHS
> -
>
> Key: SPARK-36998
> URL: https://issues.apache.org/jira/browse/SPARK-36998
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Assignee: Thejdeep Gudivada
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> SHS throws this exception when trying to make room for parsing of a log file. 
> Reason for this is - there is a race condition to make space for processing 
> of two log files and the deleteDirectory method is overlapping.
> {code:java}
> 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause 
> usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN 
> HttpChannel: handleException 
> /api/v1/applications/application_1632281309592_2767775/1/jobs 
> java.io.IOException : Unable to delete directory 
> /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb.
>  21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError 
> javax.servlet.ServletException: 
> org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable 
> to delete directory /grid 
> /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) 
> at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) 
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
>  at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) 
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
>  at 
> org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>  at org.sparkproject.jetty.server.Server.handle(Server.java:516) at 
> org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>  at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) 
> at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at 
> org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
>  at 
> org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>  at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at 
> org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
>  at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down

2021-11-07 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440042#comment-17440042
 ] 

Chao Sun commented on SPARK-37220:
--

Thanks [~hyukjin.kwon]!

> Do not split input file for Parquet reader with aggregate push down
> ---
>
> Key: SPARK-37220
> URL: https://issues.apache.org/jira/browse/SPARK-37220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a followup of 
> [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC 
> aggregate push down, we can disallow split input files for Parquet reader as 
> well. See original comment for motivation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down

2021-11-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37220.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

> Do not split input file for Parquet reader with aggregate push down
> ---
>
> Key: SPARK-37220
> URL: https://issues.apache.org/jira/browse/SPARK-37220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a followup of 
> [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC 
> aggregate push down, we can disallow split input files for Parquet reader as 
> well. See original comment for motivation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark

2021-11-05 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439554#comment-17439554
 ] 

Chao Sun commented on SPARK-37218:
--

[~dongjoon] please assign this to yourself - somehow I can't do it.

> Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
> --
>
> Key: SPARK-37218
> URL: https://issues.apache.org/jira/browse/SPARK-37218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark

2021-11-05 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37218.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34496
[https://github.com/apache/spark/pull/34496]

> Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
> --
>
> Key: SPARK-37218
> URL: https://issues.apache.org/jira/browse/SPARK-37218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-11-03 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37205:
-
Description: {{mapreduce.job.send-token-conf}} is a useful feature in 
Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with 
which RM is not required to statically have config for all the secure HDFS 
clusters. Currently it only works for MRv2 but it'd be nice if Spark can also 
use this feature. I think we only need to pass the config to 
{{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.  (was: 
{{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
[YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
not required to statically have config for all the secure HDFS clusters. 
Currently it only works for MRv2 but it'd be nice if Spark can also use this 
feature. I think we only need to pass the config to {{LaunchContainerContext}} 
before invoking {{NMClient.startContainer}}.)

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-11-03 Thread Chao Sun (Jira)

Chao Sun created SPARK-37205:


 Summary: Support mapreduce.job.send-token-conf when starting 
containers in YARN
 Key: SPARK-37205
 URL: https://issues.apache.org/jira/browse/SPARK-37205
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 3.3.0
Reporter: Chao Sun


{{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
[YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
not required to statically have config for all the secure HDFS clusters. 
Currently it only works for MRv2 but it'd be nice if Spark can also use this 
feature. I think we only need to pass the config to {{LaunchContainerContext}} 
before invoking {{NMClient.startContainer}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-01 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436963#comment-17436963
 ] 

Chao Sun commented on SPARK-37166:
--

[~xkrogen] sure just linked.

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37166) SPIP: Storage Partitioned Join

2021-10-29 Thread Chao Sun (Jira)

Chao Sun created SPARK-37166:


 Summary: SPIP: Storage Partitioned Join
 Key: SPARK-37166
 URL: https://issues.apache.org/jira/browse/SPARK-37166
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37113) Upgrade Parquet to 1.12.2

2021-10-25 Thread Chao Sun (Jira)

Chao Sun created SPARK-37113:


 Summary: Upgrade Parquet to 1.12.2
 Key: SPARK-37113
 URL: https://issues.apache.org/jira/browse/SPARK-37113
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Upgrade Parquet version to 1.12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35703) Relax constraint for Spark bucket join and remove HashClusteredDistribution

2021-10-22 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35703:
-
Summary: Relax constraint for Spark bucket join and remove 
HashClusteredDistribution  (was: Remove HashClusteredDistribution)

> Relax constraint for Spark bucket join and remove HashClusteredDistribution
> ---
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37069) HiveClientImpl throws NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns

2021-10-21 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432624#comment-17432624
 ] 

Chao Sun commented on SPARK-37069:
--

Thanks for the ping [~zhouyifan279]! yes this is a bug, and let me see how to 
fix it.

> HiveClientImpl throws NoSuchMethodError: 
> org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns
> --
>
> Key: SPARK-37069
> URL: https://issues.apache.org/jira/browse/SPARK-37069
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Zhou Yifan
>Priority: Major
>
> If we run Spark SQL with external Hive 2.3.x (before 2.3.9) jars, following 
> error will be thrown:
> {code:java}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;Exception
>  in thread "main" java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getHive$1(HiveClientImpl.scala:205)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getHive(HiveClientImpl.scala:204)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:267)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:292)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:394)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224)
>  at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>  at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>  at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>  at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:170)
>  at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:168)
>  at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:61)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:119)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:119)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:1004)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:990)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:982)
>  at 
> org.apache.spark.sql.execution.command.ShowTablesCommand.$anonfun$run$42(tables.scala:828)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.command.ShowTablesCommand.run(tables.scala:828)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(Q

[jira] [Commented] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

2021-10-13 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428522#comment-17428522
 ] 

Chao Sun commented on SPARK-35640:
--

[~catalinii] this change seems unrelated since it's only in Spark 3.2.0, but 
you mentioned the issue also happens in Spark 3.1.2. The issue seems to be also 
well-known, see SPARK-16544.

> Refactor Parquet vectorized reader to remove duplicated code paths
> --
>
> Key: SPARK-35640
> URL: https://issues.apache.org/jira/browse/SPARK-35640
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently in Parquet vectorized code path, there are many code duplications 
> such as the following:
> {code:java}
>   public void readIntegers(
>   int total,
>   WritableColumnVector c,
>   int rowId,
>   int level,
>   VectorizedValuesReader data) throws IOException {
> int left = total;
> while (left > 0) {
>   if (this.currentCount == 0) this.readNextGroup();
>   int n = Math.min(left, this.currentCount);
>   switch (mode) {
> case RLE:
>   if (currentValue == level) {
> data.readIntegers(n, c, rowId);
>   } else {
> c.putNulls(rowId, n);
>   }
>   break;
> case PACKED:
>   for (int i = 0; i < n; ++i) {
> if (currentBuffer[currentBufferIdx++] == level) {
>   c.putInt(rowId + i, data.readInteger());
> } else {
>   c.putNull(rowId + i);
> }
>   }
>   break;
>   }
>   rowId += n;
>   left -= n;
>   currentCount -= n;
> }
>   }
> {code}
> This makes it hard to maintain as any change on this will need to be 
> replicated in 20+ places. The issue becomes more serious when we are going to 
> implement column index and complex type support for the vectorized path.
> The original intention is for performance. However now days JIT compilers 
> tend to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories

2021-10-08 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426255#comment-17426255
 ] 

Chao Sun commented on SPARK-36936:
--

[~colin.williams] Spark 3.2.0 is not released yet - it will be there soon.

> spark-hadoop-cloud broken on release and only published via 3rd party 
> repositories
> --
>
> Key: SPARK-36936
> URL: https://issues.apache.org/jira/browse/SPARK-36936
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1, 3.1.2
> Environment: name:=spark-demo
> version := "0.0.1"
> scalaVersion := "2.12.12"
> lazy val app = (project in file("app")).settings(
>  assemblyPackageScala / assembleArtifact := false,
>  assembly / assemblyJarName := "uber.jar",
>  assembly / mainClass := Some("com.example.Main"),
>  // more settings here ...
>  )
> resolvers += "Cloudera" at 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % 
> "3.1.1.3.1.7270.0-253"
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % 
> "3.1.1.7.2.7.0-184"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
> libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
> // test suite settings
> fork in Test := true
> javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", 
> "-XX:+CMSClassUnloadingEnabled")
> // Show runtime of tests
> testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
> ___
>  
> import org.apache.spark.sql.SparkSession
> object SparkApp {
>  def main(args: Array[String]){
>  val spark = SparkSession.builder().master("local")
>  //.config("spark.jars.repositories", 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";)
>  //.config("spark.jars.packages", 
> "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
>  .appName("spark session").getOrCreate
>  val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
>  val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
>  jsonDF.show()
>  csvDF.show()
>  }
> }
>Reporter: Colin Williams
>Priority: Major
>
> The spark docmentation suggests using `spark-hadoop-cloud` to read / write 
> from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . 
> However artifacts are currently published via only 3rd party resolvers in 
> [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] 
> including Cloudera and Palantir.
>  
> Then apache spark documentation is providing a 3rd party solution for object 
> stores including S3. Furthermore, if you follow the instructions and include 
> one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release 
> and try to access object store, the following exception is returned.
>  
> ```
> Exception in thread "main" java.lang.NoSuchMethodError: 'void 
> com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
> java.lang.Object, java.lang.Object)'
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
>  at 
> org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
> ```
> It looks like there are classpath conflicts using the cloudera published 
> `spark-hadoop-cloud` with spark 3.1.2, again contradicting the documentation.
> Then the

[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories

2021-10-06 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425162#comment-17425162
 ] 

Chao Sun commented on SPARK-36936:
--

[~colin.williams] which version of {{spark-hadoop-cloud}} you were using? I 
think the above error shouldn't happen if the version is the same as the 
Spark's version.

We've already started to publish {{spark-hadoop-cloud}} as part of the Spark 
release procedure, see SPARK-35844.

> spark-hadoop-cloud broken on release and only published via 3rd party 
> repositories
> --
>
> Key: SPARK-36936
> URL: https://issues.apache.org/jira/browse/SPARK-36936
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1, 3.1.2
> Environment: name:=spark-demo
> version := "0.0.1"
> scalaVersion := "2.12.12"
> lazy val app = (project in file("app")).settings(
>  assemblyPackageScala / assembleArtifact := false,
>  assembly / assemblyJarName := "uber.jar",
>  assembly / mainClass := Some("com.example.Main"),
>  // more settings here ...
>  )
> resolvers += "Cloudera" at 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % 
> "3.1.1.3.1.7270.0-253"
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % 
> "3.1.1.7.2.7.0-184"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
> libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
> // test suite settings
> fork in Test := true
> javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", 
> "-XX:+CMSClassUnloadingEnabled")
> // Show runtime of tests
> testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
> ___
>  
> import org.apache.spark.sql.SparkSession
> object SparkApp {
>  def main(args: Array[String]){
>  val spark = SparkSession.builder().master("local")
>  //.config("spark.jars.repositories", 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";)
>  //.config("spark.jars.packages", 
> "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
>  .appName("spark session").getOrCreate
>  val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
>  val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
>  jsonDF.show()
>  csvDF.show()
>  }
> }
>Reporter: Colin Williams
>Priority: Major
>
> The spark docmentation suggests using `spark-hadoop-cloud` to read / write 
> from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . 
> However artifacts are currently published via only 3rd party resolvers in 
> [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] 
> including Cloudera and Palantir.
>  
> Then apache spark documentation is providing a 3rd party solution for object 
> stores including S3. Furthermore, if you follow the instructions and include 
> one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release 
> and try to access object store, the following exception is returned.
>  
> ```
> Exception in thread "main" java.lang.NoSuchMethodError: 'void 
> com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
> java.lang.Object, java.lang.Object)'
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
>  at 
> org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
>  at org.apache.spark.sql.DataFrameRead

[jira] [Updated] (SPARK-36891) Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding

2021-10-05 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36891:
-
Parent: SPARK-35743
Issue Type: Sub-task  (was: Test)

> Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized 
> Parquet decoding
> -
>
> Key: SPARK-36891
> URL: https://issues.apache.org/jira/browse/SPARK-36891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> Add a new test suite to add more coverage for Parquet vectorized decoding, 
> focusing on different combinations of Parquet column index, dictionary, batch 
> size, page size, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-10-05 Thread Chao Sun (Jira)

Chao Sun created SPARK-36935:


 Summary: Enhance ParquetSchemaConverter to capture Parquet 
repetition & definition level
 Key: SPARK-36935
 URL: https://issues.apache.org/jira/browse/SPARK-36935
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


In order to support complex type for Parquet vectorized reader, we'll need to 
capture the repetition & definition level information associated with Catalyst 
Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36891) Add new test suite to cover Parquet decoding

2021-09-29 Thread Chao Sun (Jira)

Chao Sun created SPARK-36891:


 Summary: Add new test suite to cover Parquet decoding
 Key: SPARK-36891
 URL: https://issues.apache.org/jira/browse/SPARK-36891
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Add a new test suite to add more coverage for Parquet vectorized decoding, 
focusing on different combinations of Parquet column index, dictionary, batch 
size, page size, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-09-28 Thread Chao Sun (Jira)

Chao Sun created SPARK-36879:


 Summary: Support Parquet v2 data page encodings for the vectorized 
path
 Key: SPARK-36879
 URL: https://issues.apache.org/jira/browse/SPARK-36879
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) 
in the vectorized path, and throws exception otherwise:
{code}
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
{code}

It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36873:
-
Issue Type: Bug  (was: Improvement)

> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>  package com.google.common.annotations does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>  package com.google.common.base does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>  package com.google.common.collect does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
>  cannot find symbol
>   symbol:   class VisibleForTesting
>   location: class org.apache.spark.network.yarn.YarnShuffleService
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36873:
-
Description: 
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which was changed by SPARK-33212 where we moved 
to shaded Hadoop client which no longer expose the transitive guava dependency. 
This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}

  was:
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we moved 
to shaded Hadoop client which no longer expose the transitive guava dependency. 
This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}


> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ER

[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36873:
-
Description: 
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we moved 
to shaded Hadoop client which no longer expose the transitive guava dependency. 
This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}

  was:
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we have 
moved to shaded Hadoop client which no longer expose the transitive guava 
dependency. This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}


> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which got changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8

[jira] [Created] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)

Chao Sun created SPARK-36873:


 Summary: Add provided Guava dependency for network-yarn module
 Key: SPARK-36873
 URL: https://issues.apache.org/jira/browse/SPARK-36873
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Chao Sun


In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we have 
moved to shaded Hadoop client which no longer expose the transitive guava 
dependency. This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36863) Update dependency manifests for all released artifacts

2021-09-27 Thread Chao Sun (Jira)

Chao Sun created SPARK-36863:


 Summary: Update dependency manifests for all released artifacts
 Key: SPARK-36863
 URL: https://issues.apache.org/jira/browse/SPARK-36863
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


We should update dependency manifests for all released artifacts. Currently we 
don't do for modules such as {{hadoop-cloud}}, {{kinesis-asl}} etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36835) Spark 3.2.0 POMs are no longer "dependency reduced"

2021-09-23 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419499#comment-17419499
 ] 

Chao Sun commented on SPARK-36835:
--

Sorry for the regression [~joshrosen]. I forgot exactly why I added that but 
let me see if we can safely revert it.

> Spark 3.2.0 POMs are no longer "dependency reduced"
> ---
>
> Key: SPARK-36835
> URL: https://issues.apache.org/jira/browse/SPARK-36835
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> It looks like Spark 3.2.0's POMs are no longer "dependency reduced". As a 
> result, applications may pull in additional unnecessary dependencies when 
> depending on Spark.
> Spark uses the Maven Shade plugin to create effective POMs and to bundle 
> shaded versions of certain libraries with Spark (namely, Jetty, Guava, and 
> JPPML). [By 
> default|https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html#createDependencyReducedPom],
>  the Maven Shade plugin generates simplified POMs which remove dependencies 
> on artifacts that have been shaded.
> SPARK-33212 / 
> [b6f46ca29742029efea2790af7fdefbc2fcf52de|https://github.com/apache/spark/commit/b6f46ca29742029efea2790af7fdefbc2fcf52de]
>  changed the configuration of the Maven Shade plugin, setting 
> {{createDependencyReducedPom}} to {{false}}.
> As a result, the generated POMs now include compile-scope dependencies on the 
> shaded libraries. For example, compare the {{org.eclipse.jetty}} dependencies 
> in:
>  * Spark 3.1.2: 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.2/spark-core_2.12-3.1.2.pom]
>  * Spark 3.2.0 RC2: 
> [https://repository.apache.org/content/repositories/orgapachespark-1390/org/apache/spark/spark-core_2.12/3.2.0/spark-core_2.12-3.2.0.pom]
> I think we should revert back to generating "dependency reduced" POMs to 
> ensure that Spark declares a proper set of dependencies and to avoid "unknown 
> unknown" consequences of changing our generated POM format.
> /cc [~csun]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36828) Remove Guava from Spark binary distribution

2021-09-22 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36828:
-
Issue Type: Improvement  (was: Bug)

> Remove Guava from Spark binary distribution
> ---
>
> Key: SPARK-36828
> URL: https://issues.apache.org/jira/browse/SPARK-36828
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> After SPARK-36676, we should consider removing Guava from Spark's binary 
> distribution. It is currently only required by a few libraries such as 
> curator-client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36828) Remove Guava from Spark binary distribution

2021-09-22 Thread Chao Sun (Jira)

Chao Sun created SPARK-36828:


 Summary: Remove Guava from Spark binary distribution
 Key: SPARK-36828
 URL: https://issues.apache.org/jira/browse/SPARK-36828
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


After SPARK-36676, we should consider removing Guava from Spark's binary 
distribution. It is currently only required by a few libraries such as 
curator-client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Chao Sun (Jira)

Chao Sun created SPARK-36820:


 Summary: Disable LZ4 test for Hadoop 2.7 profile
 Key: SPARK-36820
 URL: https://issues.apache.org/jira/browse/SPARK-36820
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
{{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36820:
-
Issue Type: Test  (was: Bug)

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36726:
-
Priority: Blocker  (was: Major)

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Blocker
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Chao Sun (Jira)

Chao Sun created SPARK-36726:


 Summary: Upgrade Parquet to 1.12.1
 Key: SPARK-36726
 URL: https://issues.apache.org/jira/browse/SPARK-36726
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions

2021-09-09 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412897#comment-17412897
 ] 

Chao Sun commented on SPARK-35959:
--

[~hyukjin.kwon] No I don't think it qualifies as blocker anymore. In fact I'm 
thinking to abandon the PR since it is not too useful.

> Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions 
> -
>
> Key: SPARK-35959
> URL: https://issues.apache.org/jira/browse/SPARK-35959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark uses Hadoop shaded client by default. However, if Spark users 
> want to build Spark with older version of Hadoop, such as 3.1.x, the shaded 
> client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). 
> Therefore, this proposes to offer a new Maven profile "no-shaded-client" for 
> this use case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions

2021-09-09 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35959:
-
Priority: Major  (was: Blocker)

> Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions 
> -
>
> Key: SPARK-35959
> URL: https://issues.apache.org/jira/browse/SPARK-35959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark uses Hadoop shaded client by default. However, if Spark users 
> want to build Spark with older version of Hadoop, such as 3.1.x, the shaded 
> client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). 
> Therefore, this proposes to offer a new Maven profile "no-shaded-client" for 
> this use case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412167#comment-17412167
 ] 

Chao Sun commented on SPARK-36696:
--

[This|https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L1331]
 looks suspicious: why column chunk file offset = dictionary/data page offset + 
compressed size of the column chunk?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412164#comment-17412164
 ] 

Chao Sun commented on SPARK-36696:
--

This looks like the same issue as in PARQUET-2078. The file offset for the 
first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} 
to filter out the only row group (range filter is [0, 37968], while startIndex 
is 31173 and total size is 35820).

Seems there is a bug in Apache Arrow which writes incorrect file offset. cc 
[~gershinsky] to see if you know any info there.

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 >

201 - 300 of 482 matches

Mail list logo