[jira] [Created] (SPARK-38840) Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default
Chao Sun created SPARK-38840: Summary: Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default Key: SPARK-38840 URL: https://issues.apache.org/jira/browse/SPARK-38840 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Chao Sun We can enable {{spark.sql.parquet.enableNestedColumnVectorizedReader}} on master branch by default, to make sure it is covered by more tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"
[ https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38786. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36075 [https://github.com/apache/spark/pull/36075] > Test Bug in StatisticsSuite "change stats after add/drop partition command" > --- > > Key: SPARK-38786 > URL: https://issues.apache.org/jira/browse/SPARK-38786 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.4.0 > > > [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979] > It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste > bug. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"
[ https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-38786: Assignee: Kazuyuki Tanimura > Test Bug in StatisticsSuite "change stats after add/drop partition command" > --- > > Key: SPARK-38786 > URL: https://issues.apache.org/jira/browse/SPARK-38786 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > > [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979] > It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste > bug. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34863) Support nested column in Spark Parquet vectorized readers
[ https://issues.apache.org/jira/browse/SPARK-34863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-34863: Assignee: Chao Sun (was: Apache Spark) > Support nested column in Spark Parquet vectorized readers > - > > Key: SPARK-34863 > URL: https://issues.apache.org/jira/browse/SPARK-34863 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Chao Sun >Priority: Minor > Fix For: 3.3.0 > > > The task is to support nested column type in Spark Parquet vectorized reader. > Currently Parquet vectorized reader does not support nested column type > (struct, array and map). We implemented nested column vectorized reader for > FB-ORC in our internal fork of Spark. We are seeing performance improvement > compared to non-vectorized reader when reading nested columns. In addition, > this can also help improve the non-nested column performance when reading > non-nested and nested columns together in one query. > > Parquet: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
[ https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37378: - Fix Version/s: 3.4.0 > Convert V2 Transform expressions into catalyst expressions and load their > associated functions from V2 FunctionCatalog > -- > > Key: SPARK-37378 > URL: https://issues.apache.org/jira/browse/SPARK-37378 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > We need to add logic to convert a V2 {{Transform}} expression into its > catalyst expression counterpart, and also load its function definition from > the V2 FunctionCatalog provided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
[ https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37378. -- Resolution: Duplicate This JIRA is covered as part of SPARK-37377 > Convert V2 Transform expressions into catalyst expressions and load their > associated functions from V2 FunctionCatalog > -- > > Key: SPARK-37378 > URL: https://issues.apache.org/jira/browse/SPARK-37378 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > We need to add logic to convert a V2 {{Transform}} expression into its > catalyst expression counterpart, and also load its function definition from > the V2 FunctionCatalog provided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37377: - Description: This Jira tracks the initial implementation of storage-partitioned join. (was: Currently {{Partitioning}} is defined as follow: {code:scala} @Evolving public interface Partitioning { int numPartitions(); boolean satisfy(Distribution distribution); } {code} There are two issues with the interface: 1) it uses a deprecated {{Distribution}} interface, and should switch to {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently there is no way to use this in join where we want to compare reported partitionings from both sides and decide whether they are "compatible" (and thus allows Spark to eliminate shuffle). ) > Initial implementation of Storage-Partitioned Join > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > This Jira tracks the initial implementation of storage-partitioned join. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37377: - Summary: Initial implementation of Storage-Partitioned Join (was: Refactor V2 Partitioning interface and remove deprecated usage of Distribution) > Initial implementation of Storage-Partitioned Join > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Currently {{Partitioning}} is defined as follow: > {code:scala} > @Evolving > public interface Partitioning { > int numPartitions(); > boolean satisfy(Distribution distribution); > } > {code} > There are two issues with the interface: 1) it uses a deprecated > {{Distribution}} interface, and should switch to > {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently > there is no way to use this in join where we want to compare reported > partitionings from both sides and decide whether they are "compatible" (and > thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
[ https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37974: - Fix Version/s: 3.3.0 (was: 3.4.0) > Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings > for Parquet V2 support > --- > > Key: SPARK-37974 > URL: https://issues.apache.org/jira/browse/SPARK-37974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > Fix For: 3.3.0 > > > SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer > values, but does not implement the DELTA_BYTE_ARRAY encoding which is for > string values. DELTA_BYTE_ARRAY encoding also requires the > DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized > versions as the current implementation simply calls the non-vectorized > Parquet library methods. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
[ https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37974. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35262 [https://github.com/apache/spark/pull/35262] > Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings > for Parquet V2 support > --- > > Key: SPARK-37974 > URL: https://issues.apache.org/jira/browse/SPARK-37974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > Fix For: 3.4.0 > > > SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer > values, but does not implement the DELTA_BYTE_ARRAY encoding which is for > string values. DELTA_BYTE_ARRAY encoding also requires the > DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized > versions as the current implementation simply calls the non-vectorized > Parquet library methods. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
[ https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37974: Assignee: Parth Chandra > Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings > for Parquet V2 support > --- > > Key: SPARK-37974 > URL: https://issues.apache.org/jira/browse/SPARK-37974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > > SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer > values, but does not implement the DELTA_BYTE_ARRAY encoding which is for > string values. DELTA_BYTE_ARRAY encoding also requires the > DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized > versions as the current implementation simply calls the non-vectorized > Parquet library methods. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36679) Remove lz4 hadoop wrapper classes after Hadoop 3.3.2
[ https://issues.apache.org/jira/browse/SPARK-36679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-36679. -- Fix Version/s: 3.3.0 Resolution: Duplicate > Remove lz4 hadoop wrapper classes after Hadoop 3.3.2 > > > Key: SPARK-36679 > URL: https://issues.apache.org/jira/browse/SPARK-36679 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > Lz4-java as provided dependency is not correctly excluded from relocation in > Hadoop shaded client libraries in Hadoop 3.3.1. (HADOOP-17891) > > In order to deal the issue without reverting back to non-shade client > libraries, we add a few Lz4 Hadoop wrapper classes `LZ4Factory`, > `LZ4Compressor`, and `LZ4SafeDecompressor`, under the package > `org.apache.hadoop.shaded.net.jpountz.lz4`. > We should remove these wrapper classes after Hadoop 3.3.2 release which > should include the fix. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38179) Improve WritableColumnVector to better support null struct
[ https://issues.apache.org/jira/browse/SPARK-38179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38179. -- Resolution: Won't Fix > Improve WritableColumnVector to better support null struct > -- > > Key: SPARK-38179 > URL: https://issues.apache.org/jira/browse/SPARK-38179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Minor > > Currently {{WritableColumnVector}} of struct type requires to allocate space > in all child vectors for null elements. This is not very space efficient. In > addition, this model doesn't work well with Parquet vectorized scan for > struct (in SPARK-34863). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38237) Introduce a new config to require all cluster keys on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-38237: Assignee: Cheng Su > Introduce a new config to require all cluster keys on Aggregate > --- > > Key: SPARK-38237 > URL: https://issues.apache.org/jira/browse/SPARK-38237 > Project: Spark > Issue Type: Task > Components: SQL, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Cheng Su >Priority: Major > Fix For: 3.3.0 > > > We still find HashClusteredDistribution be useful for batch query as well. > For example, we had a case with lower parallelism than expected due to the > fact ClusteredDistribution is used for aggregation which matches with > HashPartitioning with sub-key groups (note that the technical parallelism > also depends on "cardinality" - picking sub-key groups means having less > cardinality). > We propose to introduce a new config to require all cluster keys on > Aggregate, leveraging HashClusteredDistribution. That said, we propose to > rename back HashClusteredDistribution with retaining NOTE for stateful > operator. The distribution should not be still touched anyway due to the > requirement of stateful operator, but can be co-used with batch case if > needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38237) Introduce a new config to require all cluster keys on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38237. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35574 [https://github.com/apache/spark/pull/35574] > Introduce a new config to require all cluster keys on Aggregate > --- > > Key: SPARK-38237 > URL: https://issues.apache.org/jira/browse/SPARK-38237 > Project: Spark > Issue Type: Task > Components: SQL, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > Fix For: 3.3.0 > > > We still find HashClusteredDistribution be useful for batch query as well. > For example, we had a case with lower parallelism than expected due to the > fact ClusteredDistribution is used for aggregation which matches with > HashPartitioning with sub-key groups (note that the technical parallelism > also depends on "cardinality" - picking sub-key groups means having less > cardinality). > We propose to introduce a new config to require all cluster keys on > Aggregate, leveraging HashClusteredDistribution. That said, we propose to > rename back HashClusteredDistribution with retaining NOTE for stateful > operator. The distribution should not be still touched anyway due to the > requirement of stateful operator, but can be co-used with batch case if > needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38179) Improve WritableColumnVector to better support null struct
Chao Sun created SPARK-38179: Summary: Improve WritableColumnVector to better support null struct Key: SPARK-38179 URL: https://issues.apache.org/jira/browse/SPARK-38179 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently {{WritableColumnVector}} of struct type requires to allocate space in all child vectors for null elements. This is not very space efficient. In addition, this model doesn't work well with Parquet vectorized scan for struct (in SPARK-34863). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38077) Spark 3.2.1 breaks binary compatibility with Spark 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484894#comment-17484894 ] Chao Sun commented on SPARK-38077: -- BTW [~thesamet] it seems Spark only guarantees API compatibility, not binary compatibility across versions. See https://spark.apache.org/versioning-policy.html > Spark 3.2.1 breaks binary compatibility with Spark 3.2.0 > > > Key: SPARK-38077 > URL: https://issues.apache.org/jira/browse/SPARK-38077 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Nadav Samet >Priority: Major > > [PR 35243|https://github.com/apache/spark/pull/35243] introduced a new > parameter to class `Invoke` with a default value (`isDeterministic: Boolean = > true`). Existing Spark libraries (such as > [frameless|https://github.com/typelevel/frameless]) that invoke > [Invoke|https://github.com/typelevel/frameless/blob/29961d549e332dddf5cd711ef699dde7460cc48a/dataset/src/main/scala/frameless/RecordEncoder.scala#L154] > directly expect a method with 7 parameters, and the new version expects 8. > If Frameless would recompile with Spark 3.2.1, the updated library will NOT > be binary compatible with Spark 3.2.0. Adding default parameters to existing > methods [should be > avoided|https://github.com/jatcwang/binary-compatibility-guide#dont-adding-parameters-with-default-values-to-methods]. > One way forward would be to revert the change in the constructor and > introduce a second constructor or a companion method that takes all the 8 > parameters. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38077) Spark 3.2.1 breaks binary compatibility with Spark 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484873#comment-17484873 ] Chao Sun commented on SPARK-38077: -- Sorry for breaking the binary compatibility. I wasn't aware that `Invoke` is used by other libraries outside Spark and was merely following how other parameters are defined (namely `propagateNull` and `returnNullable`). Let me work on a PR to fix it. > Spark 3.2.1 breaks binary compatibility with Spark 3.2.0 > > > Key: SPARK-38077 > URL: https://issues.apache.org/jira/browse/SPARK-38077 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Nadav Samet >Priority: Major > > [PR 35243|https://github.com/apache/spark/pull/35243] introduced a new > parameter to class `Invoke` with a default value (`isDeterministic: Boolean = > true`). Existing Spark libraries (such as > [frameless|https://github.com/typelevel/frameless]) that invoke > [Invoke|https://github.com/typelevel/frameless/blob/29961d549e332dddf5cd711ef699dde7460cc48a/dataset/src/main/scala/frameless/RecordEncoder.scala#L154] > directly expect a method with 7 parameters, and the new version expects 8. > If Frameless would recompile with Spark 3.2.1, the updated library will NOT > be binary compatible with Spark 3.2.0. Adding default parameters to existing > methods [should be > avoided|https://github.com/jatcwang/binary-compatibility-guide#dont-adding-parameters-with-default-values-to-methods]. > One way forward would be to revert the change in the constructor and > introduce a second constructor or a companion method that takes all the 8 > parameters. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4
[ https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483399#comment-17483399 ] Chao Sun commented on SPARK-37994: -- Glad it helped [~tanvu]! {quote} We can omit the -Dcurator.version=2.13.0 -Dcommons-io.version=2.8.0 part, though {quote} Yea perhaps. I added them here to just keep the version in sync with what being used by Hadoop 3.x It's annoying that we have to make it compile in this way though. Let me think whether I should resume SPARK-35959 and add a Maven profile for this. > Unable to build spark3.2 with -Dhadoop.version=3.1.4 > > > Key: SPARK-37994 > URL: https://issues.apache.org/jira/browse/SPARK-37994 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Vu Tan >Priority: Minor > > I downloaded Spark 3.2 sourcecode from > [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip] > and try building with the below command > {code:java} > ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr > -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 > -Pkubernetes {code} > Then it gives the below error > {code:java} > [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ > spark-core_2.12 --- > [INFO] Using incremental compilation using Mixed compile order > [INFO] Compiler bridge file: > /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar > [INFO] compiler plugin: > BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null) > [INFO] Compiling 567 Scala sources and 104 Java sources to > /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ... > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778: > not found: type ArrayWritable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777: > not found: type Writable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: > object security is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284: > not found: value UserGroupInformation > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39: > object mapred is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:348: > not found: type Credenti
[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4
[ https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482632#comment-17482632 ] Chao Sun commented on SPARK-37994: -- [~tanvu] Hmm in that case maybe you can try: {code} ./dev/make-distribution.sh --name without-hadoop --pip --tgz -Psparkr -Phive -Phive-thriftserver -Phadoop-provided -Pyarn \ -Dhadoop.version=3.1.4 -Phadoop-2.7 -Dcurator.version=2.13.0 -Dcommons-io.version=2.8.0 {code} I tried it and it seems to work. > -Dhadoop-client-runtime.artifact should be hadoop-client, not hadoop-yarn-api That PR is outdated. We switched to use hadoop-yarn-api in order to avoid the exact issue around dependency-reduced-pom.xml you mentioned above. > Unable to build spark3.2 with -Dhadoop.version=3.1.4 > > > Key: SPARK-37994 > URL: https://issues.apache.org/jira/browse/SPARK-37994 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Vu Tan >Priority: Minor > > I downloaded Spark 3.2 sourcecode from > [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip] > and try building with the below command > {code:java} > ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr > -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 > -Pkubernetes {code} > Then it gives the below error > {code:java} > [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ > spark-core_2.12 --- > [INFO] Using incremental compilation using Mixed compile order > [INFO] Compiler bridge file: > /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar > [INFO] compiler plugin: > BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null) > [INFO] Compiling 567 Scala sources and 104 Java sources to > /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ... > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778: > not found: type ArrayWritable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777: > not found: type Writable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: > object security is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284: > not found: value UserGroupInformation > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39: > object mapred is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36: > object conf is not a member of package org.apache.
[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4
[ https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481327#comment-17481327 ] Chao Sun commented on SPARK-37994: -- I considered to add a new Maven profile for Hadoop versions <= 2.x (see SPARK-35959) but abandoned it due to lack of interest. I could pick it up a again if people think it is a good idea. > Unable to build spark3.2 with -Dhadoop.version=3.1.4 > > > Key: SPARK-37994 > URL: https://issues.apache.org/jira/browse/SPARK-37994 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Vu Tan >Priority: Minor > > I downloaded Spark 3.2 sourcecode from > [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip] > and try building with the below command > {code:java} > ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr > -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 > -Pkubernetes {code} > Then it gives the below error > {code:java} > [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ > spark-core_2.12 --- > [INFO] Using incremental compilation using Mixed compile order > [INFO] Compiler bridge file: > /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar > [INFO] compiler plugin: > BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null) > [INFO] Compiling 567 Scala sources and 104 Java sources to > /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ... > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778: > not found: type ArrayWritable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777: > not found: type Writable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: > object security is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284: > not found: value UserGroupInformation > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39: > object mapred is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:348: > not found: type Credentials > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:350: > not found: value UserGroupInformation > [ERROR]
[jira] [Commented] (SPARK-37994) Unable to build spark3.2 with -Dhadoop.version=3.1.4
[ https://issues.apache.org/jira/browse/SPARK-37994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481326#comment-17481326 ] Chao Sun commented on SPARK-37994: -- Yes, thanks [~xkrogen] for pinging me. [~tanvu]: can you try this command instead? {code} ./dev/make-distribution.sh --name without-hadoop --pip --tgz -Psparkr -Phive -Phive-thriftserver -Phadoop-provided -Pyarn \ -Dhadoop.version=3.1.4 -Pkubernetes \ -Dhadoop-client-api.artifact=hadoop-client \ -Dhadoop-client-runtime.artifact=hadoop-yarn-api \ -Dhadoop-client-minicluster.artifact=hadoop-client {code} > Unable to build spark3.2 with -Dhadoop.version=3.1.4 > > > Key: SPARK-37994 > URL: https://issues.apache.org/jira/browse/SPARK-37994 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Vu Tan >Priority: Minor > > I downloaded Spark 3.2 sourcecode from > [https://github.com/apache/spark/archive/refs/tags/v3.2.0.zip] > and try building with the below command > {code:java} > ./dev/make-distribution.sh --name without-hadoop --pip --r --tgz -Psparkr > -Phive -Phive-thriftserver -Phadoop-provided -Pyarn -Dhadoop.version=3.1.4 > -Pkubernetes {code} > Then it gives the below error > {code:java} > [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ > spark-core_2.12 --- > [INFO] Using incremental compilation using Mixed compile order > [INFO] Compiler bridge file: > /Users/JP28431/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar > [INFO] compiler plugin: > BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null) > [INFO] Compiling 567 Scala sources and 104 Java sources to > /Users/JP28431/Downloads/spark-3.2.0-github/core/target/scala-2.12/classes ... > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:38: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2778: > not found: type ArrayWritable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:2777: > not found: type Writable > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:24: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SSLOptions.scala:174: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:25: > object io is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: > object security is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:33: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:32: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala:121: > not found: type Configuration > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityManager.scala:284: > not found: value UserGroupInformation > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:41: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:40: > object mapreduce is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:39: > object mapred is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:37: > object fs is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SparkContext.scala:36: > object conf is not a member of package org.apache.hadoop > [ERROR] [Error] > /Users/JP28431/Downloads/spark-3.2.0-github/core/src/main/scala/org/apache/spark/SecurityMana
[jira] [Updated] (SPARK-37957) Deterministic flag is not handled for V2 functions
[ https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37957: - Fix Version/s: 3.2.1 > Deterministic flag is not handled for V2 functions > -- > > Key: SPARK-37957 > URL: https://issues.apache.org/jira/browse/SPARK-37957 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.1, 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37957) Deterministic flag is not handled for V2 functions
[ https://issues.apache.org/jira/browse/SPARK-37957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37957. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35243 [https://github.com/apache/spark/pull/35243] > Deterministic flag is not handled for V2 functions > -- > > Key: SPARK-37957 > URL: https://issues.apache.org/jira/browse/SPARK-37957 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37928: Assignee: Yang Jie > Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark > -- > > Key: SPARK-37928 > URL: https://issues.apache.org/jira/browse/SPARK-37928 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37928) Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37928. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35226 [https://github.com/apache/spark/pull/35226] > Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark > -- > > Key: SPARK-37928 > URL: https://issues.apache.org/jira/browse/SPARK-37928 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37957) Deterministic flag is not handled for V2 functions
Chao Sun created SPARK-37957: Summary: Deterministic flag is not handled for V2 functions Key: SPARK-37957 URL: https://issues.apache.org/jira/browse/SPARK-37957 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Assignee: Chao Sun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37864) Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-37864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37864. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35163 [https://github.com/apache/spark/pull/35163] > Support Parquet v2 data page RLE encoding (for Boolean Values) for the > vectorized path > -- > > Key: SPARK-37864 > URL: https://issues.apache.org/jira/browse/SPARK-37864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > Parquet v2 data page write Boolean Values use RLE encoding, when read v2 > boolean type values it will throw exceptions as follows now: > > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:48) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:250) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:237) > ~[classes/:?] > at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) > ~[parquet-column-1.12.2.jar:1.12.2] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:237) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:173) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:298) > ~[classes/:?] > ... 19 more {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37864) Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-37864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37864: Assignee: Yang Jie > Support Parquet v2 data page RLE encoding (for Boolean Values) for the > vectorized path > -- > > Key: SPARK-37864 > URL: https://issues.apache.org/jira/browse/SPARK-37864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > Parquet v2 data page write Boolean Values use RLE encoding, when read v2 > boolean type values it will throw exceptions as follows now: > > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:48) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:250) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:237) > ~[classes/:?] > at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) > ~[parquet-column-1.12.2.jar:1.12.2] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:237) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:173) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:298) > ~[classes/:?] > ... 19 more {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-36879: Assignee: Parth Chandra > Support Parquet v2 data page encodings for the vectorized path > -- > > Key: SPARK-36879 > URL: https://issues.apache.org/jira/browse/SPARK-36879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Parth Chandra >Priority: Major > Fix For: 3.3.0 > > > Currently Spark only support Parquet V1 encodings (i.e., > PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: > {code} > java.lang.UnsupportedOperationException: Unsupported encoding: > DELTA_BYTE_ARRAY > {code} > It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, > DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as > listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-36879. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34471 [https://github.com/apache/spark/pull/34471] > Support Parquet v2 data page encodings for the vectorized path > -- > > Key: SPARK-36879 > URL: https://issues.apache.org/jira/browse/SPARK-36879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > Currently Spark only support Parquet V1 encodings (i.e., > PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: > {code} > java.lang.UnsupportedOperationException: Unsupported encoding: > DELTA_BYTE_ARRAY > {code} > It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, > DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as > listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37633: - Affects Version/s: (was: 3.0.3) > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37633: Assignee: Manu Zhang > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37633. -- Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 34888 [https://github.com/apache/spark/pull/34888] > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.3.0, 3.2.1 > > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37217: - Fix Version/s: 3.2.1 > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.2.1, 3.3.0 > > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37481: - Fix Version/s: 3.2.1 (was: 3.2.0) > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37217: Assignee: dzcxzl > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37217. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34493 [https://github.com/apache/spark/pull/34493] > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4
[ https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37573. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34830 [https://github.com/apache/spark/pull/34830] > IsolatedClient fallbackVersion should be build in version, not always 2.7.4 > > > Key: SPARK-37573 > URL: https://issues.apache.org/jira/browse/SPARK-37573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Hadoop 3 fallback to 2.7.4 cause error > {code} > [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 > seconds, 320 milliseconds) > [info] java.lang.ClassFormatError: Truncated class file > [info] at java.lang.ClassLoader.defineClass1(Native Method) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:756) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:635) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:405) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313) > [info] at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50) > [info] at > org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4
[ https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37573: Assignee: angerszhu > IsolatedClient fallbackVersion should be build in version, not always 2.7.4 > > > Key: SPARK-37573 > URL: https://issues.apache.org/jira/browse/SPARK-37573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Hadoop 3 fallback to 2.7.4 cause error > {code} > [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 > seconds, 320 milliseconds) > [info] java.lang.ClassFormatError: Truncated class file > [info] at java.lang.ClassLoader.defineClass1(Native Method) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:756) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:635) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:405) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313) > [info] at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50) > [info] at > org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37600) Upgrade to Hadoop 3.3.2
Chao Sun created SPARK-37600: Summary: Upgrade to Hadoop 3.3.2 Key: SPARK-37600 URL: https://issues.apache.org/jira/browse/SPARK-37600 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Chao Sun Upgrade Spark to use Hadoop 3.3.2 once it's released. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37561: Assignee: dzcxzl > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37561. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34822 [https://github.com/apache/spark/pull/34822] > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37205. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34635 [https://github.com/apache/spark/pull/34635] > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37205: Assignee: Chao Sun > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37445: Assignee: angerszhu > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37445. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34715 [https://github.com/apache/spark/pull/34715] > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36529: - Attachment: (was: image.png) > Decouple CPU with IO work in vectorized Parquet reader > -- > > Key: SPARK-36529 > URL: https://issues.apache.org/jira/browse/SPARK-36529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently it seems the vectorized Parquet reader does almost everything in a > sequential manner: > 1. read the row group using file system API (perhaps from remote storage like > S3) > 2. allocate buffers and store those row group bytes into them > 3. decompress the data pages > 4. in Spark, decode all the read columns one by one > 5. read the next row group and repeat from 1. > A lot of improvements can be done to decouple the IO and CPU intensive work. > In addition, we could parallelize the row group loading and column decoding, > and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36529: - Attachment: image.png > Decouple CPU with IO work in vectorized Parquet reader > -- > > Key: SPARK-36529 > URL: https://issues.apache.org/jira/browse/SPARK-36529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently it seems the vectorized Parquet reader does almost everything in a > sequential manner: > 1. read the row group using file system API (perhaps from remote storage like > S3) > 2. allocate buffers and store those row group bytes into them > 3. decompress the data pages > 4. in Spark, decode all the read columns one by one > 5. read the next row group and repeat from 1. > A lot of improvements can be done to decouple the IO and CPU intensive work. > In addition, we could parallelize the row group loading and column decoding, > and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-35867. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34611 [https://github.com/apache/spark/pull/34611] > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.3.0 > > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-35867: Assignee: Kazuyuki Tanimura > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Kazuyuki Tanimura >Priority: Minor > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
Chao Sun created SPARK-37378: Summary: Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog Key: SPARK-37378 URL: https://issues.apache.org/jira/browse/SPARK-37378 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun We need to add logic to convert a V2 {{Transform}} expression into its catalyst expression counterpart, and also load its function definition from the V2 FunctionCatalog provided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution
Chao Sun created SPARK-37377: Summary: Refactor V2 Partitioning interface and remove deprecated usage of Distribution Key: SPARK-37377 URL: https://issues.apache.org/jira/browse/SPARK-37377 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently {{Partitioning}} is defined as follow: {code:scala} @Evolving public interface Partitioning { int numPartitions(); boolean satisfy(Distribution distribution); } {code} There are two issues with the interface: 1) it uses a deprecated {{Distribution}} interface, and should switch to {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently there is no way to use this in join where we want to compare reported partitionings from both sides and decide whether they are "compatible" (and thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37376) Introduce a new DataSource V2 interface HasPartitionKey
Chao Sun created SPARK-37376: Summary: Introduce a new DataSource V2 interface HasPartitionKey Key: SPARK-37376 URL: https://issues.apache.org/jira/browse/SPARK-37376 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun One of the pre-requisite for the feature is to allow V2 input partitions to report their partition values to Spark, which can use them to compare if both sides of join are co-partitioned, and also optionally group input partitions together. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37166: - Parent: SPARK-37375 Issue Type: Sub-task (was: New Feature) > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37375) Umbrella: Storage Partitioned Join
Chao Sun created SPARK-37375: Summary: Umbrella: Storage Partitioned Join Key: SPARK-37375 URL: https://issues.apache.org/jira/browse/SPARK-37375 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun This umbrella JIRA tracks the progress of implementing Storage Partitioned Join feature for Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37166. -- Fix Version/s: 3.3.0 Assignee: Chao Sun Resolution: Fixed > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37342: - Component/s: Build (was: Spark Core) > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
Chao Sun created SPARK-37342: Summary: Upgrade Apache Arrow to 6.0.0 Key: SPARK-37342 URL: https://issues.apache.org/jira/browse/SPARK-37342 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Chao Sun Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode
[ https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37239. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34520 [https://github.com/apache/spark/pull/34520] > Avoid unnecessary `setReplication` in Yarn mode > --- > > Key: SPARK-37239 > URL: https://issues.apache.org/jira/browse/SPARK-37239 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.1.2 >Reporter: wang-zhun >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > We found a large number of replication logs in hdfs server > ``` > 2021-11-04,17:22:13,065 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip > 2021-11-04,17:22:13,069 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip > 2021-11-04,17:22:13,070 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip > ``` > https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439 > > `setReplication` needs to acquire write lock, we should reduce this > unnecessary operation -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode
[ https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37239: Assignee: Yang Jie > Avoid unnecessary `setReplication` in Yarn mode > --- > > Key: SPARK-37239 > URL: https://issues.apache.org/jira/browse/SPARK-37239 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.1.2 >Reporter: wang-zhun >Assignee: Yang Jie >Priority: Major > > We found a large number of replication logs in hdfs server > ``` > 2021-11-04,17:22:13,065 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip > 2021-11-04,17:22:13,069 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip > 2021-11-04,17:22:13,070 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip > ``` > https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439 > > `setReplication` needs to acquire write lock, we should reduce this > unnecessary operation -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35437: - Priority: Major (was: Minor) > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Major > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-35437. -- Resolution: Fixed Issue resolved by pull request 34431 [https://github.com/apache/spark/pull/34431] > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-35437: Assignee: dzcxzl > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36998) Handle concurrent eviction of same application in SHS
[ https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440066#comment-17440066 ] Chao Sun commented on SPARK-36998: -- Fixed > Handle concurrent eviction of same application in SHS > - > > Key: SPARK-36998 > URL: https://issues.apache.org/jira/browse/SPARK-36998 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Assignee: Thejdeep Gudivada >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > SHS throws this exception when trying to make room for parsing of a log file. > Reason for this is - there is a race condition to make space for processing > of two log files and the deleteDirectory method is overlapping. > {code:java} > 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause > usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN > HttpChannel: handleException > /api/v1/applications/application_1632281309592_2767775/1/jobs > java.io.IOException : Unable to delete directory > /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. > 21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError > javax.servlet.ServletException: > org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable > to delete directory /grid > /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) > at > org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) > at > org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) > at > org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) > at > org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) > at > org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) > at > org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350) > at > org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763) > at > org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234) > at > org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > at org.sparkproject.jetty.server.Server.handle(Server.java:516) at > org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) > at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at > org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279) > at > org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at > org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36998) Handle concurrent eviction of same application in SHS
[ https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-36998: Assignee: Thejdeep Gudivada (was: Thejdeep) > Handle concurrent eviction of same application in SHS > - > > Key: SPARK-36998 > URL: https://issues.apache.org/jira/browse/SPARK-36998 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Assignee: Thejdeep Gudivada >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > SHS throws this exception when trying to make room for parsing of a log file. > Reason for this is - there is a race condition to make space for processing > of two log files and the deleteDirectory method is overlapping. > {code:java} > 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause > usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN > HttpChannel: handleException > /api/v1/applications/application_1632281309592_2767775/1/jobs > java.io.IOException : Unable to delete directory > /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. > 21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError > javax.servlet.ServletException: > org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable > to delete directory /grid > /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) > at > org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) > at > org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) > at > org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) > at > org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) > at > org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) > at > org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350) > at > org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763) > at > org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234) > at > org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > at org.sparkproject.jetty.server.Server.handle(Server.java:516) at > org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) > at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at > org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279) > at > org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at > org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440042#comment-17440042 ] Chao Sun commented on SPARK-37220: -- Thanks [~hyukjin.kwon]! > Do not split input file for Parquet reader with aggregate push down > --- > > Key: SPARK-37220 > URL: https://issues.apache.org/jira/browse/SPARK-37220 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > As a followup of > [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC > aggregate push down, we can disallow split input files for Parquet reader as > well. See original comment for motivation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37220. -- Fix Version/s: 3.3.0 Resolution: Fixed > Do not split input file for Parquet reader with aggregate push down > --- > > Key: SPARK-37220 > URL: https://issues.apache.org/jira/browse/SPARK-37220 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > As a followup of > [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC > aggregate push down, we can disallow split input files for Parquet reader as > well. See original comment for motivation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439554#comment-17439554 ] Chao Sun commented on SPARK-37218: -- [~dongjoon] please assign this to yourself - somehow I can't do it. > Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark > -- > > Key: SPARK-37218 > URL: https://issues.apache.org/jira/browse/SPARK-37218 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37218. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34496 [https://github.com/apache/spark/pull/34496] > Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark > -- > > Key: SPARK-37218 > URL: https://issues.apache.org/jira/browse/SPARK-37218 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37205: - Description: {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. (was: {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} before invoking {{NMClient.startContainer}}.) > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
Chao Sun created SPARK-37205: Summary: Support mapreduce.job.send-token-conf when starting containers in YARN Key: SPARK-37205 URL: https://issues.apache.org/jira/browse/SPARK-37205 Project: Spark Issue Type: New Feature Components: YARN Affects Versions: 3.3.0 Reporter: Chao Sun {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} before invoking {{NMClient.startContainer}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436963#comment-17436963 ] Chao Sun commented on SPARK-37166: -- [~xkrogen] sure just linked. > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37166) SPIP: Storage Partitioned Join
Chao Sun created SPARK-37166: Summary: SPIP: Storage Partitioned Join Key: SPARK-37166 URL: https://issues.apache.org/jira/browse/SPARK-37166 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37113) Upgrade Parquet to 1.12.2
Chao Sun created SPARK-37113: Summary: Upgrade Parquet to 1.12.2 Key: SPARK-37113 URL: https://issues.apache.org/jira/browse/SPARK-37113 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Upgrade Parquet version to 1.12.2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35703) Relax constraint for Spark bucket join and remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35703: - Summary: Relax constraint for Spark bucket join and remove HashClusteredDistribution (was: Remove HashClusteredDistribution) > Relax constraint for Spark bucket join and remove HashClusteredDistribution > --- > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37069) HiveClientImpl throws NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns
[ https://issues.apache.org/jira/browse/SPARK-37069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432624#comment-17432624 ] Chao Sun commented on SPARK-37069: -- Thanks for the ping [~zhouyifan279]! yes this is a bug, and let me see how to fix it. > HiveClientImpl throws NoSuchMethodError: > org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns > -- > > Key: SPARK-37069 > URL: https://issues.apache.org/jira/browse/SPARK-37069 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Zhou Yifan >Priority: Major > > If we run Spark SQL with external Hive 2.3.x (before 2.3.9) jars, following > error will be thrown: > {code:java} > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;Exception > in thread "main" java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive; > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getHive$1(HiveClientImpl.scala:205) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.hive.client.HiveClientImpl.getHive(HiveClientImpl.scala:204) > at > org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:267) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:292) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:394) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224) > at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:170) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:168) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:61) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:1004) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:990) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:982) > at > org.apache.spark.sql.execution.command.ShowTablesCommand.$anonfun$run$42(tables.scala:828) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.command.ShowTablesCommand.run(tables.scala:828) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(Q
[jira] [Commented] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths
[ https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428522#comment-17428522 ] Chao Sun commented on SPARK-35640: -- [~catalinii] this change seems unrelated since it's only in Spark 3.2.0, but you mentioned the issue also happens in Spark 3.1.2. The issue seems to be also well-known, see SPARK-16544. > Refactor Parquet vectorized reader to remove duplicated code paths > -- > > Key: SPARK-35640 > URL: https://issues.apache.org/jira/browse/SPARK-35640 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > Currently in Parquet vectorized code path, there are many code duplications > such as the following: > {code:java} > public void readIntegers( > int total, > WritableColumnVector c, > int rowId, > int level, > VectorizedValuesReader data) throws IOException { > int left = total; > while (left > 0) { > if (this.currentCount == 0) this.readNextGroup(); > int n = Math.min(left, this.currentCount); > switch (mode) { > case RLE: > if (currentValue == level) { > data.readIntegers(n, c, rowId); > } else { > c.putNulls(rowId, n); > } > break; > case PACKED: > for (int i = 0; i < n; ++i) { > if (currentBuffer[currentBufferIdx++] == level) { > c.putInt(rowId + i, data.readInteger()); > } else { > c.putNull(rowId + i); > } > } > break; > } > rowId += n; > left -= n; > currentCount -= n; > } > } > {code} > This makes it hard to maintain as any change on this will need to be > replicated in 20+ places. The issue becomes more serious when we are going to > implement column index and complex type support for the vectorized path. > The original intention is for performance. However now days JIT compilers > tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories
[ https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426255#comment-17426255 ] Chao Sun commented on SPARK-36936: -- [~colin.williams] Spark 3.2.0 is not released yet - it will be there soon. > spark-hadoop-cloud broken on release and only published via 3rd party > repositories > -- > > Key: SPARK-36936 > URL: https://issues.apache.org/jira/browse/SPARK-36936 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.1, 3.1.2 > Environment: name:=spark-demo > version := "0.0.1" > scalaVersion := "2.12.12" > lazy val app = (project in file("app")).settings( > assemblyPackageScala / assembleArtifact := false, > assembly / assemblyJarName := "uber.jar", > assembly / mainClass := Some("com.example.Main"), > // more settings here ... > ) > resolvers += "Cloudera" at > "https://repository.cloudera.com/artifactory/cloudera-repos/"; > libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % > "3.1.1.3.1.7270.0-253" > libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % > "3.1.1.7.2.7.0-184" > libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901" > libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test" > // test suite settings > fork in Test := true > javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", > "-XX:+CMSClassUnloadingEnabled") > // Show runtime of tests > testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD") > ___ > > import org.apache.spark.sql.SparkSession > object SparkApp { > def main(args: Array[String]){ > val spark = SparkSession.builder().master("local") > //.config("spark.jars.repositories", > "https://repository.cloudera.com/artifactory/cloudera-repos/";) > //.config("spark.jars.packages", > "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253") > .appName("spark session").getOrCreate > val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json") > val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv") > jsonDF.show() > csvDF.show() > } > } >Reporter: Colin Williams >Priority: Major > > The spark docmentation suggests using `spark-hadoop-cloud` to read / write > from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . > However artifacts are currently published via only 3rd party resolvers in > [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] > including Cloudera and Palantir. > > Then apache spark documentation is providing a 3rd party solution for object > stores including S3. Furthermore, if you follow the instructions and include > one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release > and try to access object store, the following exception is returned. > > ``` > Exception in thread "main" java.lang.NoSuchMethodError: 'void > com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, > java.lang.Object, java.lang.Object)' > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894) > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870) > at > org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428) > ``` > It looks like there are classpath conflicts using the cloudera published > `spark-hadoop-cloud` with spark 3.1.2, again contradicting the documentation. > Then the
[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories
[ https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425162#comment-17425162 ] Chao Sun commented on SPARK-36936: -- [~colin.williams] which version of {{spark-hadoop-cloud}} you were using? I think the above error shouldn't happen if the version is the same as the Spark's version. We've already started to publish {{spark-hadoop-cloud}} as part of the Spark release procedure, see SPARK-35844. > spark-hadoop-cloud broken on release and only published via 3rd party > repositories > -- > > Key: SPARK-36936 > URL: https://issues.apache.org/jira/browse/SPARK-36936 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.1, 3.1.2 > Environment: name:=spark-demo > version := "0.0.1" > scalaVersion := "2.12.12" > lazy val app = (project in file("app")).settings( > assemblyPackageScala / assembleArtifact := false, > assembly / assemblyJarName := "uber.jar", > assembly / mainClass := Some("com.example.Main"), > // more settings here ... > ) > resolvers += "Cloudera" at > "https://repository.cloudera.com/artifactory/cloudera-repos/"; > libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % > "3.1.1.3.1.7270.0-253" > libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % > "3.1.1.7.2.7.0-184" > libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901" > libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test" > // test suite settings > fork in Test := true > javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", > "-XX:+CMSClassUnloadingEnabled") > // Show runtime of tests > testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD") > ___ > > import org.apache.spark.sql.SparkSession > object SparkApp { > def main(args: Array[String]){ > val spark = SparkSession.builder().master("local") > //.config("spark.jars.repositories", > "https://repository.cloudera.com/artifactory/cloudera-repos/";) > //.config("spark.jars.packages", > "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253") > .appName("spark session").getOrCreate > val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json") > val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv") > jsonDF.show() > csvDF.show() > } > } >Reporter: Colin Williams >Priority: Major > > The spark docmentation suggests using `spark-hadoop-cloud` to read / write > from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . > However artifacts are currently published via only 3rd party resolvers in > [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] > including Cloudera and Palantir. > > Then apache spark documentation is providing a 3rd party solution for object > stores including S3. Furthermore, if you follow the instructions and include > one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release > and try to access object store, the following exception is returned. > > ``` > Exception in thread "main" java.lang.NoSuchMethodError: 'void > com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, > java.lang.Object, java.lang.Object)' > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894) > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870) > at > org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519) > at org.apache.spark.sql.DataFrameRead
[jira] [Updated] (SPARK-36891) Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding
[ https://issues.apache.org/jira/browse/SPARK-36891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36891: - Parent: SPARK-35743 Issue Type: Sub-task (was: Test) > Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized > Parquet decoding > - > > Key: SPARK-36891 > URL: https://issues.apache.org/jira/browse/SPARK-36891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > Add a new test suite to add more coverage for Parquet vectorized decoding, > focusing on different combinations of Parquet column index, dictionary, batch > size, page size, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level
Chao Sun created SPARK-36935: Summary: Enhance ParquetSchemaConverter to capture Parquet repetition & definition level Key: SPARK-36935 URL: https://issues.apache.org/jira/browse/SPARK-36935 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun In order to support complex type for Parquet vectorized reader, we'll need to capture the repetition & definition level information associated with Catalyst Spark type converted from Parquet {{MessageType}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36891) Add new test suite to cover Parquet decoding
Chao Sun created SPARK-36891: Summary: Add new test suite to cover Parquet decoding Key: SPARK-36891 URL: https://issues.apache.org/jira/browse/SPARK-36891 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Add a new test suite to add more coverage for Parquet vectorized decoding, focusing on different combinations of Parquet column index, dictionary, batch size, page size, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
Chao Sun created SPARK-36879: Summary: Support Parquet v2 data page encodings for the vectorized path Key: SPARK-36879 URL: https://issues.apache.org/jira/browse/SPARK-36879 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: {code} java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY {code} It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module
[ https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36873: - Issue Type: Bug (was: Improvement) > Add provided Guava dependency for network-yarn module > - > > Key: SPARK-36873 > URL: https://issues.apache.org/jira/browse/SPARK-36873 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > In Spark 3.1 and earlier the network-yarn module implicitly relies on guava > from hadoop-client dependency, which was changed by SPARK-33212 where we > moved to shaded Hadoop client which no longer expose the transitive guava > dependency. This was fine for a while since we were not using > {{createDependencyReducedPom}} so the module picks up the transitive > dependency from {{spark-network-common}}. However, this got changed by > SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no > longer able to find guava classes: > {code} > mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver > -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 > -Pspark-ganglia-lgpl -Pyarn > ... > [INFO] Compiling 1 Java source to > /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... > [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: > package com.google.common.annotations does not exist > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: > package com.google.common.base does not exist > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: > package com.google.common.collect does not exist > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: > cannot find symbol > symbol: class VisibleForTesting > location: class org.apache.spark.network.yarn.YarnShuffleService > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module
[ https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36873: - Description: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} was: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} > Add provided Guava dependency for network-yarn module > - > > Key: SPARK-36873 > URL: https://issues.apache.org/jira/browse/SPARK-36873 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > In Spark 3.1 and earlier the network-yarn module implicitly relies on guava > from hadoop-client dependency, which was changed by SPARK-33212 where we > moved to shaded Hadoop client which no longer expose the transitive guava > dependency. This was fine for a while since we were not using > {{createDependencyReducedPom}} so the module picks up the transitive > dependency from {{spark-network-common}}. However, this got changed by > SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no > longer able to find guava classes: > {code} > mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver > -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 > -Pspark-ganglia-lgpl -Pyarn > ... > [INFO] Compiling 1 Java source to > /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... > [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 > [ER
[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module
[ https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36873: - Description: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} was: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we have moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} > Add provided Guava dependency for network-yarn module > - > > Key: SPARK-36873 > URL: https://issues.apache.org/jira/browse/SPARK-36873 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > In Spark 3.1 and earlier the network-yarn module implicitly relies on guava > from hadoop-client dependency, which got changed by SPARK-33212 where we > moved to shaded Hadoop client which no longer expose the transitive guava > dependency. This was fine for a while since we were not using > {{createDependencyReducedPom}} so the module picks up the transitive > dependency from {{spark-network-common}}. However, this got changed by > SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no > longer able to find guava classes: > {code} > mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver > -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 > -Pspark-ganglia-lgpl -Pyarn > ... > [INFO] Compiling 1 Java source to > /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... > [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[jira] [Created] (SPARK-36873) Add provided Guava dependency for network-yarn module
Chao Sun created SPARK-36873: Summary: Add provided Guava dependency for network-yarn module Key: SPARK-36873 URL: https://issues.apache.org/jira/browse/SPARK-36873 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Chao Sun In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we have moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36863) Update dependency manifests for all released artifacts
Chao Sun created SPARK-36863: Summary: Update dependency manifests for all released artifacts Key: SPARK-36863 URL: https://issues.apache.org/jira/browse/SPARK-36863 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Chao Sun We should update dependency manifests for all released artifacts. Currently we don't do for modules such as {{hadoop-cloud}}, {{kinesis-asl}} etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36835) Spark 3.2.0 POMs are no longer "dependency reduced"
[ https://issues.apache.org/jira/browse/SPARK-36835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419499#comment-17419499 ] Chao Sun commented on SPARK-36835: -- Sorry for the regression [~joshrosen]. I forgot exactly why I added that but let me see if we can safely revert it. > Spark 3.2.0 POMs are no longer "dependency reduced" > --- > > Key: SPARK-36835 > URL: https://issues.apache.org/jira/browse/SPARK-36835 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Josh Rosen >Priority: Blocker > > It looks like Spark 3.2.0's POMs are no longer "dependency reduced". As a > result, applications may pull in additional unnecessary dependencies when > depending on Spark. > Spark uses the Maven Shade plugin to create effective POMs and to bundle > shaded versions of certain libraries with Spark (namely, Jetty, Guava, and > JPPML). [By > default|https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html#createDependencyReducedPom], > the Maven Shade plugin generates simplified POMs which remove dependencies > on artifacts that have been shaded. > SPARK-33212 / > [b6f46ca29742029efea2790af7fdefbc2fcf52de|https://github.com/apache/spark/commit/b6f46ca29742029efea2790af7fdefbc2fcf52de] > changed the configuration of the Maven Shade plugin, setting > {{createDependencyReducedPom}} to {{false}}. > As a result, the generated POMs now include compile-scope dependencies on the > shaded libraries. For example, compare the {{org.eclipse.jetty}} dependencies > in: > * Spark 3.1.2: > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.2/spark-core_2.12-3.1.2.pom] > * Spark 3.2.0 RC2: > [https://repository.apache.org/content/repositories/orgapachespark-1390/org/apache/spark/spark-core_2.12/3.2.0/spark-core_2.12-3.2.0.pom] > I think we should revert back to generating "dependency reduced" POMs to > ensure that Spark declares a proper set of dependencies and to avoid "unknown > unknown" consequences of changing our generated POM format. > /cc [~csun] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36828) Remove Guava from Spark binary distribution
[ https://issues.apache.org/jira/browse/SPARK-36828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36828: - Issue Type: Improvement (was: Bug) > Remove Guava from Spark binary distribution > --- > > Key: SPARK-36828 > URL: https://issues.apache.org/jira/browse/SPARK-36828 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > After SPARK-36676, we should consider removing Guava from Spark's binary > distribution. It is currently only required by a few libraries such as > curator-client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36828) Remove Guava from Spark binary distribution
Chao Sun created SPARK-36828: Summary: Remove Guava from Spark binary distribution Key: SPARK-36828 URL: https://issues.apache.org/jira/browse/SPARK-36828 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.0 Reporter: Chao Sun After SPARK-36676, we should consider removing Guava from Spark's binary distribution. It is currently only required by a few libraries such as curator-client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile
Chao Sun created SPARK-36820: Summary: Disable LZ4 test for Hadoop 2.7 profile Key: SPARK-36820 URL: https://issues.apache.org/jira/browse/SPARK-36820 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in {{FileSourceCodecSuite}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile
[ https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36820: - Issue Type: Test (was: Bug) > Disable LZ4 test for Hadoop 2.7 profile > --- > > Key: SPARK-36820 > URL: https://issues.apache.org/jira/browse/SPARK-36820 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in > {{FileSourceCodecSuite}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36726) Upgrade Parquet to 1.12.1
[ https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36726: - Priority: Blocker (was: Major) > Upgrade Parquet to 1.12.1 > - > > Key: SPARK-36726 > URL: https://issues.apache.org/jira/browse/SPARK-36726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Blocker > > Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36726) Upgrade Parquet to 1.12.1
Chao Sun created SPARK-36726: Summary: Upgrade Parquet to 1.12.1 Key: SPARK-36726 URL: https://issues.apache.org/jira/browse/SPARK-36726 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions
[ https://issues.apache.org/jira/browse/SPARK-35959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412897#comment-17412897 ] Chao Sun commented on SPARK-35959: -- [~hyukjin.kwon] No I don't think it qualifies as blocker anymore. In fact I'm thinking to abandon the PR since it is not too useful. > Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions > - > > Key: SPARK-35959 > URL: https://issues.apache.org/jira/browse/SPARK-35959 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark uses Hadoop shaded client by default. However, if Spark users > want to build Spark with older version of Hadoop, such as 3.1.x, the shaded > client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). > Therefore, this proposes to offer a new Maven profile "no-shaded-client" for > this use case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35959) Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions
[ https://issues.apache.org/jira/browse/SPARK-35959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35959: - Priority: Major (was: Blocker) > Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions > - > > Key: SPARK-35959 > URL: https://issues.apache.org/jira/browse/SPARK-35959 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark uses Hadoop shaded client by default. However, if Spark users > want to build Spark with older version of Hadoop, such as 3.1.x, the shaded > client cannot be used (currently it only support Hadoop 3.2.2+ and 3.3.1+). > Therefore, this proposes to offer a new Maven profile "no-shaded-client" for > this use case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412167#comment-17412167 ] Chao Sun commented on SPARK-36696: -- [This|https://github.com/apache/arrow/blob/master/cpp/src/parquet/metadata.cc#L1331] looks suspicious: why column chunk file offset = dictionary/data page offset + compressed size of the column chunk? > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset
[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412164#comment-17412164 ] Chao Sun commented on SPARK-36696: -- This looks like the same issue as in PARQUET-2078. The file offset for the first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} to filter out the only row group (range filter is [0, 37968], while startIndex is 31173 and total size is 35820). Seems there is a bug in Apache Arrow which writes incorrect file offset. cc [~gershinsky] to see if you know any info there. > spark.read.parquet loads empty dataset > -- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org