[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2
[ https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268063#comment-17268063 ] Chao Sun commented on SPARK-33507: -- [~aokolnychyi] could you elaborate on the question? currently Spark doesn't support caching streaming tables yet. > Improve and fix cache behavior in v1 and v2 > --- > > Key: SPARK-33507 > URL: https://issues.apache.org/jira/browse/SPARK-33507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Critical > > This is an umbrella JIRA to track fixes & improvements for caching behavior > in Spark datasource v1 and v2, which includes: > - fix existing cache behavior in v1 and v2. > - fix inconsistent cache behavior between v1 and v2 > - implement missing features in v2 to align with those in v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34052) A cached view should become invalid after a table is dropped
[ https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272333#comment-17272333 ] Chao Sun commented on SPARK-34052: -- [~hyukjin.kwon] [~cloud_fan] do you think we should include this in 3.1.1? since we've changed how temp view work in SPARK-33142 it may be better to add this too to make it consistent. > A cached view should become invalid after a table is dropped > > > Key: SPARK-34052 > URL: https://issues.apache.org/jira/browse/SPARK-34052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > It seems a view doesn't become invalid after a DSv2 table is dropped or > replaced. This is different from V1 and may cause correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27589) Spark file source V2
[ https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273161#comment-17273161 ] Chao Sun commented on SPARK-27589: -- [~xkrogen] FWIW I'm working on a POC for SPARK-32935 at the moment. There is also a design doc under working. Hopefully we'll be able to share it soon. cc [~rdblue] too. > Spark file source V2 > > > Key: SPARK-27589 > URL: https://issues.apache.org/jira/browse/SPARK-27589 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Re-implement file sources with data source V2 API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34271) Use majorMinorPatchVersion for Hive version parsing
Chao Sun created SPARK-34271: Summary: Use majorMinorPatchVersion for Hive version parsing Key: SPARK-34271 URL: https://issues.apache.org/jira/browse/SPARK-34271 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Currently {{IsolatedClientLoader}} need to enumerate all Hive patch versions. Therefore, whenever we upgrade Hive version we'd have to remember to update the method. It would be better if we just check major & minor version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34108: - Description: Currently, caching a temporary or permenant view doesn't work in certain cases. For instance, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t CACHE TABLE v1 SELECT key FROM t {code} The last SELECT query will hit the cached {{v1}}. On the other hand: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t ORDER by key CACHE TABLE v1 SELECT key FROM t ORDER BY key {code} The SELECT won't hit the cache. It seems this is related to {{EliminateView}}. In the second case, it will insert an extra project operator which makes the comparison on canonicalized plan during cache lookup fail. was: Currently, caching a permanent view doesn't work in certain cases. For instance, in the following: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t CACHE TABLE v1 SELECT key FROM t {code} The last SELECT query will hit the cached {{v1}}. On the other hand: {code:sql} CREATE TABLE t (key bigint, value string) USING parquet CREATE VIEW v1 AS SELECT key FROM t ORDER by key CACHE TABLE v1 SELECT key FROM t ORDER BY key {code} The SELECT won't hit the cache. It seems this is related to {{EliminateView}}. In the second case, it will insert an extra project operator which makes the comparison on canonicalized plan during cache lookup fail. > Cache lookup doesn't work in certain cases > -- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a temporary or permenant view doesn't work in certain > cases. For instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-34108: - Summary: Cache lookup doesn't work in certain cases (was: Caching with permanent view doesn't work in certain cases) > Cache lookup doesn't work in certain cases > -- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a permanent view doesn't work in certain cases. For > instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34108) Cache lookup doesn't work in certain cases
[ https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-34108. -- Resolution: Duplicate > Cache lookup doesn't work in certain cases > -- > > Key: SPARK-34108 > URL: https://issues.apache.org/jira/browse/SPARK-34108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently, caching a temporary or permenant view doesn't work in certain > cases. For instance, in the following: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t > CACHE TABLE v1 > SELECT key FROM t > {code} > The last SELECT query will hit the cached {{v1}}. On the other hand: > {code:sql} > CREATE TABLE t (key bigint, value string) USING parquet > CREATE VIEW v1 AS SELECT key FROM t ORDER by key > CACHE TABLE v1 > SELECT key FROM t ORDER BY key > {code} > The SELECT won't hit the cache. > It seems this is related to {{EliminateView}}. In the second case, it will > insert an extra project operator which makes the comparison on canonicalized > plan during cache lookup fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34347) CatalogImpl.uncacheTable should invalidate in cascade for temp views
Chao Sun created SPARK-34347: Summary: CatalogImpl.uncacheTable should invalidate in cascade for temp views Key: SPARK-34347 URL: https://issues.apache.org/jira/browse/SPARK-34347 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun When {{spark.sql.legacy.storeAnalyzedPlanForView}} is false, {{CatalogImpl.uncacheTable}} should invalidate caches for temp view in cascade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34419) Move PartitionTransforms from java to scala directory
Chao Sun created SPARK-34419: Summary: Move PartitionTransforms from java to scala directory Key: SPARK-34419 URL: https://issues.apache.org/jira/browse/SPARK-34419 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun {{PartitionTransforms}} is currently under {{sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions}}. It should be under {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289200#comment-17289200 ] Chao Sun commented on SPARK-33212: -- Thanks for the report [~ouyangxc.zte]. Can you provide more details? such as error messages, stack traces, steps to reproduce the issue, etc? > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289652#comment-17289652 ] Chao Sun commented on SPARK-33212: -- Thanks for the details [~ouyangxc.zte]! {quote} Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath {quote} This is interesting. the {{hadoop-client-minicluster.jar}} should only be used in tests - curious why it is needed here. Could you share stacktraces for the {{ClassNotFoundException}}? {quote} 2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter {quote} Could you also share the stacktraces for this exception? And to confirm, you are using {{client}} as the deploy mode, is that correct? I'll try to reproduce this in my local environment. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290127#comment-17290127 ] Chao Sun commented on SPARK-33212: -- Thanks again [~ouyangxc.zte]. {{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}} was not included in the {{hadoop-client}} jars since it is a server-side class and ideally should not be exposed to client applications such as Spark. [~dongjoon] Let me see how we can fix this either in Spark or Hadoop. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613 ] Chao Sun commented on SPARK-33212: -- I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613 ] Chao Sun edited comment on SPARK-33212 at 2/25/21, 2:21 AM: I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think only {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. was (Author: csun): I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290707#comment-17290707 ] Chao Sun commented on SPARK-33212: -- Yes. I think the only class Spark needs from this jar is {{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}}, which together with other two classes it depends on from the same package, do not have Guava dependency except {{VisibleForTesting}}. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-32703: - Summary: Replace deprecated API calls from SpecificParquetRecordReaderBase (was: Enable dictionary filtering for Parquet vectorized reader) > Replace deprecated API calls from SpecificParquetRecordReaderBase > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > > Parquet vectorized reader still uses the old API for {{filterRowGroups}} and > only filters on statistics. It should switch to the new API and do dictionary > filtering as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-32703: - Description: Currently in {{SpecificParquetRecordReaderBase}} we use deprecated APIs in a few places from Parquet, such as {{readFooter}}, {{ParquetInputSplit}}, deprecated ctor for {{ParquetFileReader}}, {{filterRowGroups}}, etc. These are going to be removed in some of the future Parquet versions so we should move to the new APIs for better maintainability. (was: Parquet vectorized reader still uses the old API for {{filterRowGroups}} and only filters on statistics. It should switch to the new API and do dictionary filtering as well.) > Replace deprecated API calls from SpecificParquetRecordReaderBase > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > > Currently in {{SpecificParquetRecordReaderBase}} we use deprecated APIs in a > few places from Parquet, such as {{readFooter}}, {{ParquetInputSplit}}, > deprecated ctor for {{ParquetFileReader}}, {{filterRowGroups}}, etc. These > are going to be removed in some of the future Parquet versions so we should > move to the new APIs for better maintainability. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305109#comment-17305109 ] Chao Sun commented on SPARK-34780: -- Thanks for the reporting [~mikechen], the test case you provided is very useful. I'm not sure, though, how severe is the issue since it only affects {{computeStats}}, and when the cache is actually materialized (e.g., via {{df2.count()}} after {{df2.cache()}}), the value from {{computeStats}} will be different anyways. Could you give more details? > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30497) migrate DESCRIBE TABLE to the new framework
[ https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308067#comment-17308067 ] Chao Sun commented on SPARK-30497: -- [~cloud_fan] this is resolved right? > migrate DESCRIBE TABLE to the new framework > --- > > Key: SPARK-30497 > URL: https://issues.apache.org/jira/browse/SPARK-30497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308262#comment-17308262 ] Chao Sun commented on SPARK-34780: -- Sorry for the late reply [~mikechen]! There's something I still not quite clear: when the cache is retrieved, a {{InMemoryRelation}} will be used to replace the plan fragment that is matched. Therefore, how can the old stale conf still be used in places like {{DataSourceScanExec}}? > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308854#comment-17308854 ] Chao Sun commented on SPARK-34780: -- [~mikechen], yes you're right. I'm not sure if this is a big concern though, since it just means the plan fragment for the cache is executed with the stale conf. I guess as long as there is no correctness issue (which I'd be surprised to see if there's any), it should be fine? It seems a bit tricky to fix the issue, since the {{SparkSession}} is leaked to many places. I guess one way is to follow the idea of SPARK-33389 and change {{SessionState}} to always use the active conf. > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile
Chao Sun created SPARK-36820: Summary: Disable LZ4 test for Hadoop 2.7 profile Key: SPARK-36820 URL: https://issues.apache.org/jira/browse/SPARK-36820 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in {{FileSourceCodecSuite}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile
[ https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36820: - Issue Type: Test (was: Bug) > Disable LZ4 test for Hadoop 2.7 profile > --- > > Key: SPARK-36820 > URL: https://issues.apache.org/jira/browse/SPARK-36820 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in > {{FileSourceCodecSuite}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36828) Remove Guava from Spark binary distribution
Chao Sun created SPARK-36828: Summary: Remove Guava from Spark binary distribution Key: SPARK-36828 URL: https://issues.apache.org/jira/browse/SPARK-36828 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.0 Reporter: Chao Sun After SPARK-36676, we should consider removing Guava from Spark's binary distribution. It is currently only required by a few libraries such as curator-client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36828) Remove Guava from Spark binary distribution
[ https://issues.apache.org/jira/browse/SPARK-36828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36828: - Issue Type: Improvement (was: Bug) > Remove Guava from Spark binary distribution > --- > > Key: SPARK-36828 > URL: https://issues.apache.org/jira/browse/SPARK-36828 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > After SPARK-36676, we should consider removing Guava from Spark's binary > distribution. It is currently only required by a few libraries such as > curator-client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36835) Spark 3.2.0 POMs are no longer "dependency reduced"
[ https://issues.apache.org/jira/browse/SPARK-36835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419499#comment-17419499 ] Chao Sun commented on SPARK-36835: -- Sorry for the regression [~joshrosen]. I forgot exactly why I added that but let me see if we can safely revert it. > Spark 3.2.0 POMs are no longer "dependency reduced" > --- > > Key: SPARK-36835 > URL: https://issues.apache.org/jira/browse/SPARK-36835 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Josh Rosen >Priority: Blocker > > It looks like Spark 3.2.0's POMs are no longer "dependency reduced". As a > result, applications may pull in additional unnecessary dependencies when > depending on Spark. > Spark uses the Maven Shade plugin to create effective POMs and to bundle > shaded versions of certain libraries with Spark (namely, Jetty, Guava, and > JPPML). [By > default|https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html#createDependencyReducedPom], > the Maven Shade plugin generates simplified POMs which remove dependencies > on artifacts that have been shaded. > SPARK-33212 / > [b6f46ca29742029efea2790af7fdefbc2fcf52de|https://github.com/apache/spark/commit/b6f46ca29742029efea2790af7fdefbc2fcf52de] > changed the configuration of the Maven Shade plugin, setting > {{createDependencyReducedPom}} to {{false}}. > As a result, the generated POMs now include compile-scope dependencies on the > shaded libraries. For example, compare the {{org.eclipse.jetty}} dependencies > in: > * Spark 3.1.2: > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.2/spark-core_2.12-3.1.2.pom] > * Spark 3.2.0 RC2: > [https://repository.apache.org/content/repositories/orgapachespark-1390/org/apache/spark/spark-core_2.12/3.2.0/spark-core_2.12-3.2.0.pom] > I think we should revert back to generating "dependency reduced" POMs to > ensure that Spark declares a proper set of dependencies and to avoid "unknown > unknown" consequences of changing our generated POM format. > /cc [~csun] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36863) Update dependency manifests for all released artifacts
Chao Sun created SPARK-36863: Summary: Update dependency manifests for all released artifacts Key: SPARK-36863 URL: https://issues.apache.org/jira/browse/SPARK-36863 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Chao Sun We should update dependency manifests for all released artifacts. Currently we don't do for modules such as {{hadoop-cloud}}, {{kinesis-asl}} etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36873) Add provided Guava dependency for network-yarn module
Chao Sun created SPARK-36873: Summary: Add provided Guava dependency for network-yarn module Key: SPARK-36873 URL: https://issues.apache.org/jira/browse/SPARK-36873 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: Chao Sun In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we have moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module
[ https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36873: - Description: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} was: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we have moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} > Add provided Guava dependency for network-yarn module > - > > Key: SPARK-36873 > URL: https://issues.apache.org/jira/browse/SPARK-36873 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > In Spark 3.1 and earlier the network-yarn module implicitly relies on guava > from hadoop-client dependency, which got changed by SPARK-33212 where we > moved to shaded Hadoop client which no longer expose the transitive guava > dependency. This was fine for a while since we were not using > {{createDependencyReducedPom}} so the module picks up the transitive > dependency from {{spark-network-common}}. However, this got changed by > SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no > longer able to find guava classes: > {code} > mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver > -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 > -Pspark-ganglia-lgpl -Pyarn > ... > [INFO] Compiling 1 Java source to > /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... > [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module
[ https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36873: - Description: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} was: In Spark 3.1 and earlier the network-yarn module implicitly relies on guava from hadoop-client dependency, which got changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive guava dependency. This was fine for a while since we were not using {{createDependencyReducedPom}} so the module picks up the transitive dependency from {{spark-network-common}}. However, this got changed by SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no longer able to find guava classes: {code} mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService {code} > Add provided Guava dependency for network-yarn module > - > > Key: SPARK-36873 > URL: https://issues.apache.org/jira/browse/SPARK-36873 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > In Spark 3.1 and earlier the network-yarn module implicitly relies on guava > from hadoop-client dependency, which was changed by SPARK-33212 where we > moved to shaded Hadoop client which no longer expose the transitive guava > dependency. This was fine for a while since we were not using > {{createDependencyReducedPom}} so the module picks up the transitive > dependency from {{spark-network-common}}. However, this got changed by > SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no > longer able to find guava classes: > {code} > mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver > -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 > -Pspark-ganglia-lgpl -Pyarn > ... > [INFO] Compiling 1 Java source to > /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... > [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 > [ER
[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module
[ https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36873: - Issue Type: Bug (was: Improvement) > Add provided Guava dependency for network-yarn module > - > > Key: SPARK-36873 > URL: https://issues.apache.org/jira/browse/SPARK-36873 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > In Spark 3.1 and earlier the network-yarn module implicitly relies on guava > from hadoop-client dependency, which was changed by SPARK-33212 where we > moved to shaded Hadoop client which no longer expose the transitive guava > dependency. This was fine for a while since we were not using > {{createDependencyReducedPom}} so the module picks up the transitive > dependency from {{spark-network-common}}. However, this got changed by > SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no > longer able to find guava classes: > {code} > mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver > -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 > -Pspark-ganglia-lgpl -Pyarn > ... > [INFO] Compiling 1 Java source to > /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... > [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: > package com.google.common.annotations does not exist > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: > package com.google.common.base does not exist > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: > package com.google.common.collect does not exist > [ERROR] [Error] > /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: > cannot find symbol > symbol: class VisibleForTesting > location: class org.apache.spark.network.yarn.YarnShuffleService > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
Chao Sun created SPARK-36879: Summary: Support Parquet v2 data page encodings for the vectorized path Key: SPARK-36879 URL: https://issues.apache.org/jira/browse/SPARK-36879 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: {code} java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY {code} It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36891) Add new test suite to cover Parquet decoding
Chao Sun created SPARK-36891: Summary: Add new test suite to cover Parquet decoding Key: SPARK-36891 URL: https://issues.apache.org/jira/browse/SPARK-36891 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Add a new test suite to add more coverage for Parquet vectorized decoding, focusing on different combinations of Parquet column index, dictionary, batch size, page size, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level
Chao Sun created SPARK-36935: Summary: Enhance ParquetSchemaConverter to capture Parquet repetition & definition level Key: SPARK-36935 URL: https://issues.apache.org/jira/browse/SPARK-36935 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun In order to support complex type for Parquet vectorized reader, we'll need to capture the repetition & definition level information associated with Catalyst Spark type converted from Parquet {{MessageType}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36891) Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding
[ https://issues.apache.org/jira/browse/SPARK-36891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36891: - Parent: SPARK-35743 Issue Type: Sub-task (was: Test) > Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized > Parquet decoding > - > > Key: SPARK-36891 > URL: https://issues.apache.org/jira/browse/SPARK-36891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > Add a new test suite to add more coverage for Parquet vectorized decoding, > focusing on different combinations of Parquet column index, dictionary, batch > size, page size, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories
[ https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425162#comment-17425162 ] Chao Sun commented on SPARK-36936: -- [~colin.williams] which version of {{spark-hadoop-cloud}} you were using? I think the above error shouldn't happen if the version is the same as the Spark's version. We've already started to publish {{spark-hadoop-cloud}} as part of the Spark release procedure, see SPARK-35844. > spark-hadoop-cloud broken on release and only published via 3rd party > repositories > -- > > Key: SPARK-36936 > URL: https://issues.apache.org/jira/browse/SPARK-36936 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.1, 3.1.2 > Environment: name:=spark-demo > version := "0.0.1" > scalaVersion := "2.12.12" > lazy val app = (project in file("app")).settings( > assemblyPackageScala / assembleArtifact := false, > assembly / assemblyJarName := "uber.jar", > assembly / mainClass := Some("com.example.Main"), > // more settings here ... > ) > resolvers += "Cloudera" at > "https://repository.cloudera.com/artifactory/cloudera-repos/"; > libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % > "3.1.1.3.1.7270.0-253" > libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % > "3.1.1.7.2.7.0-184" > libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901" > libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test" > // test suite settings > fork in Test := true > javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", > "-XX:+CMSClassUnloadingEnabled") > // Show runtime of tests > testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD") > ___ > > import org.apache.spark.sql.SparkSession > object SparkApp { > def main(args: Array[String]){ > val spark = SparkSession.builder().master("local") > //.config("spark.jars.repositories", > "https://repository.cloudera.com/artifactory/cloudera-repos/";) > //.config("spark.jars.packages", > "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253") > .appName("spark session").getOrCreate > val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json") > val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv") > jsonDF.show() > csvDF.show() > } > } >Reporter: Colin Williams >Priority: Major > > The spark docmentation suggests using `spark-hadoop-cloud` to read / write > from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . > However artifacts are currently published via only 3rd party resolvers in > [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] > including Cloudera and Palantir. > > Then apache spark documentation is providing a 3rd party solution for object > stores including S3. Furthermore, if you follow the instructions and include > one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release > and try to access object store, the following exception is returned. > > ``` > Exception in thread "main" java.lang.NoSuchMethodError: 'void > com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, > java.lang.Object, java.lang.Object)' > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894) > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870) > at > org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519) > at org.apache.spark.sql.DataFrameRead
[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories
[ https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426255#comment-17426255 ] Chao Sun commented on SPARK-36936: -- [~colin.williams] Spark 3.2.0 is not released yet - it will be there soon. > spark-hadoop-cloud broken on release and only published via 3rd party > repositories > -- > > Key: SPARK-36936 > URL: https://issues.apache.org/jira/browse/SPARK-36936 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.1, 3.1.2 > Environment: name:=spark-demo > version := "0.0.1" > scalaVersion := "2.12.12" > lazy val app = (project in file("app")).settings( > assemblyPackageScala / assembleArtifact := false, > assembly / assemblyJarName := "uber.jar", > assembly / mainClass := Some("com.example.Main"), > // more settings here ... > ) > resolvers += "Cloudera" at > "https://repository.cloudera.com/artifactory/cloudera-repos/"; > libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % > "3.1.1.3.1.7270.0-253" > libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % > "3.1.1.7.2.7.0-184" > libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901" > libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test" > // test suite settings > fork in Test := true > javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", > "-XX:+CMSClassUnloadingEnabled") > // Show runtime of tests > testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD") > ___ > > import org.apache.spark.sql.SparkSession > object SparkApp { > def main(args: Array[String]){ > val spark = SparkSession.builder().master("local") > //.config("spark.jars.repositories", > "https://repository.cloudera.com/artifactory/cloudera-repos/";) > //.config("spark.jars.packages", > "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253") > .appName("spark session").getOrCreate > val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json") > val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv") > jsonDF.show() > csvDF.show() > } > } >Reporter: Colin Williams >Priority: Major > > The spark docmentation suggests using `spark-hadoop-cloud` to read / write > from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . > However artifacts are currently published via only 3rd party resolvers in > [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] > including Cloudera and Palantir. > > Then apache spark documentation is providing a 3rd party solution for object > stores including S3. Furthermore, if you follow the instructions and include > one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release > and try to access object store, the following exception is returned. > > ``` > Exception in thread "main" java.lang.NoSuchMethodError: 'void > com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, > java.lang.Object, java.lang.Object)' > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894) > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870) > at > org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428) > ``` > It looks like there are classpath conflicts using the cloudera published > `spark-hadoop-cloud` with spark 3.1.2, again contradicting the documentation. > Then the
[jira] [Commented] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths
[ https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428522#comment-17428522 ] Chao Sun commented on SPARK-35640: -- [~catalinii] this change seems unrelated since it's only in Spark 3.2.0, but you mentioned the issue also happens in Spark 3.1.2. The issue seems to be also well-known, see SPARK-16544. > Refactor Parquet vectorized reader to remove duplicated code paths > -- > > Key: SPARK-35640 > URL: https://issues.apache.org/jira/browse/SPARK-35640 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > Currently in Parquet vectorized code path, there are many code duplications > such as the following: > {code:java} > public void readIntegers( > int total, > WritableColumnVector c, > int rowId, > int level, > VectorizedValuesReader data) throws IOException { > int left = total; > while (left > 0) { > if (this.currentCount == 0) this.readNextGroup(); > int n = Math.min(left, this.currentCount); > switch (mode) { > case RLE: > if (currentValue == level) { > data.readIntegers(n, c, rowId); > } else { > c.putNulls(rowId, n); > } > break; > case PACKED: > for (int i = 0; i < n; ++i) { > if (currentBuffer[currentBufferIdx++] == level) { > c.putInt(rowId + i, data.readInteger()); > } else { > c.putNull(rowId + i); > } > } > break; > } > rowId += n; > left -= n; > currentCount -= n; > } > } > {code} > This makes it hard to maintain as any change on this will need to be > replicated in 20+ places. The issue becomes more serious when we are going to > implement column index and complex type support for the vectorized path. > The original intention is for performance. However now days JIT compilers > tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37069) HiveClientImpl throws NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns
[ https://issues.apache.org/jira/browse/SPARK-37069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432624#comment-17432624 ] Chao Sun commented on SPARK-37069: -- Thanks for the ping [~zhouyifan279]! yes this is a bug, and let me see how to fix it. > HiveClientImpl throws NoSuchMethodError: > org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns > -- > > Key: SPARK-37069 > URL: https://issues.apache.org/jira/browse/SPARK-37069 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Zhou Yifan >Priority: Major > > If we run Spark SQL with external Hive 2.3.x (before 2.3.9) jars, following > error will be thrown: > {code:java} > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;Exception > in thread "main" java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive; > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getHive$1(HiveClientImpl.scala:205) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.hive.client.HiveClientImpl.getHive(HiveClientImpl.scala:204) > at > org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:267) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:292) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:394) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224) > at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:170) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:168) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:61) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:1004) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:990) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:982) > at > org.apache.spark.sql.execution.command.ShowTablesCommand.$anonfun$run$42(tables.scala:828) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.command.ShowTablesCommand.run(tables.scala:828) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(Q
[jira] [Updated] (SPARK-35703) Relax constraint for Spark bucket join and remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35703: - Summary: Relax constraint for Spark bucket join and remove HashClusteredDistribution (was: Remove HashClusteredDistribution) > Relax constraint for Spark bucket join and remove HashClusteredDistribution > --- > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37113) Upgrade Parquet to 1.12.2
Chao Sun created SPARK-37113: Summary: Upgrade Parquet to 1.12.2 Key: SPARK-37113 URL: https://issues.apache.org/jira/browse/SPARK-37113 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Upgrade Parquet version to 1.12.2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37166) SPIP: Storage Partitioned Join
Chao Sun created SPARK-37166: Summary: SPIP: Storage Partitioned Join Key: SPARK-37166 URL: https://issues.apache.org/jira/browse/SPARK-37166 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436963#comment-17436963 ] Chao Sun commented on SPARK-37166: -- [~xkrogen] sure just linked. > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
Chao Sun created SPARK-37205: Summary: Support mapreduce.job.send-token-conf when starting containers in YARN Key: SPARK-37205 URL: https://issues.apache.org/jira/browse/SPARK-37205 Project: Spark Issue Type: New Feature Components: YARN Affects Versions: 3.3.0 Reporter: Chao Sun {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} before invoking {{NMClient.startContainer}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37205: - Description: {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. (was: {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} before invoking {{NMClient.startContainer}}.) > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37218. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34496 [https://github.com/apache/spark/pull/34496] > Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark > -- > > Key: SPARK-37218 > URL: https://issues.apache.org/jira/browse/SPARK-37218 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
[ https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439554#comment-17439554 ] Chao Sun commented on SPARK-37218: -- [~dongjoon] please assign this to yourself - somehow I can't do it. > Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark > -- > > Key: SPARK-37218 > URL: https://issues.apache.org/jira/browse/SPARK-37218 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37220. -- Fix Version/s: 3.3.0 Resolution: Fixed > Do not split input file for Parquet reader with aggregate push down > --- > > Key: SPARK-37220 > URL: https://issues.apache.org/jira/browse/SPARK-37220 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > As a followup of > [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC > aggregate push down, we can disallow split input files for Parquet reader as > well. See original comment for motivation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440042#comment-17440042 ] Chao Sun commented on SPARK-37220: -- Thanks [~hyukjin.kwon]! > Do not split input file for Parquet reader with aggregate push down > --- > > Key: SPARK-37220 > URL: https://issues.apache.org/jira/browse/SPARK-37220 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > As a followup of > [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC > aggregate push down, we can disallow split input files for Parquet reader as > well. See original comment for motivation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36998) Handle concurrent eviction of same application in SHS
[ https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-36998: Assignee: Thejdeep Gudivada (was: Thejdeep) > Handle concurrent eviction of same application in SHS > - > > Key: SPARK-36998 > URL: https://issues.apache.org/jira/browse/SPARK-36998 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Assignee: Thejdeep Gudivada >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > SHS throws this exception when trying to make room for parsing of a log file. > Reason for this is - there is a race condition to make space for processing > of two log files and the deleteDirectory method is overlapping. > {code:java} > 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause > usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN > HttpChannel: handleException > /api/v1/applications/application_1632281309592_2767775/1/jobs > java.io.IOException : Unable to delete directory > /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. > 21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError > javax.servlet.ServletException: > org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable > to delete directory /grid > /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) > at > org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) > at > org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) > at > org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) > at > org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) > at > org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) > at > org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350) > at > org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763) > at > org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234) > at > org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > at org.sparkproject.jetty.server.Server.handle(Server.java:516) at > org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) > at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at > org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279) > at > org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at > org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36998) Handle concurrent eviction of same application in SHS
[ https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440066#comment-17440066 ] Chao Sun commented on SPARK-36998: -- Fixed > Handle concurrent eviction of same application in SHS > - > > Key: SPARK-36998 > URL: https://issues.apache.org/jira/browse/SPARK-36998 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Assignee: Thejdeep Gudivada >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > SHS throws this exception when trying to make room for parsing of a log file. > Reason for this is - there is a race condition to make space for processing > of two log files and the deleteDirectory method is overlapping. > {code:java} > 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause > usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN > HttpChannel: handleException > /api/v1/applications/application_1632281309592_2767775/1/jobs > java.io.IOException : Unable to delete directory > /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. > 21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError > javax.servlet.ServletException: > org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable > to delete directory /grid > /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) > at > org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) > at > org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) > at > org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) > at > org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) > at > org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) > at > org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) > at > org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350) > at > org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763) > at > org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234) > at > org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > at org.sparkproject.jetty.server.Server.handle(Server.java:516) at > org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) > at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at > org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279) > at > org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at > org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-35437: Assignee: dzcxzl > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-35437. -- Resolution: Fixed Issue resolved by pull request 34431 [https://github.com/apache/spark/pull/34431] > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35437: - Priority: Major (was: Minor) > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Major > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode
[ https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37239: Assignee: Yang Jie > Avoid unnecessary `setReplication` in Yarn mode > --- > > Key: SPARK-37239 > URL: https://issues.apache.org/jira/browse/SPARK-37239 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.1.2 >Reporter: wang-zhun >Assignee: Yang Jie >Priority: Major > > We found a large number of replication logs in hdfs server > ``` > 2021-11-04,17:22:13,065 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip > 2021-11-04,17:22:13,069 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip > 2021-11-04,17:22:13,070 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip > ``` > https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439 > > `setReplication` needs to acquire write lock, we should reduce this > unnecessary operation -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode
[ https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37239. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34520 [https://github.com/apache/spark/pull/34520] > Avoid unnecessary `setReplication` in Yarn mode > --- > > Key: SPARK-37239 > URL: https://issues.apache.org/jira/browse/SPARK-37239 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.1.2 >Reporter: wang-zhun >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > We found a large number of replication logs in hdfs server > ``` > 2021-11-04,17:22:13,065 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip > 2021-11-04,17:22:13,069 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip > 2021-11-04,17:22:13,070 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains > unchanged at 3 for > xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip > ``` > https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439 > > `setReplication` needs to acquire write lock, we should reduce this > unnecessary operation -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
Chao Sun created SPARK-37342: Summary: Upgrade Apache Arrow to 6.0.0 Key: SPARK-37342 URL: https://issues.apache.org/jira/browse/SPARK-37342 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Chao Sun Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37342: - Component/s: Build (was: Spark Core) > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37166. -- Fix Version/s: 3.3.0 Assignee: Chao Sun Resolution: Fixed > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37375) Umbrella: Storage Partitioned Join
Chao Sun created SPARK-37375: Summary: Umbrella: Storage Partitioned Join Key: SPARK-37375 URL: https://issues.apache.org/jira/browse/SPARK-37375 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun This umbrella JIRA tracks the progress of implementing Storage Partitioned Join feature for Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37166: - Parent: SPARK-37375 Issue Type: Sub-task (was: New Feature) > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37376) Introduce a new DataSource V2 interface HasPartitionKey
Chao Sun created SPARK-37376: Summary: Introduce a new DataSource V2 interface HasPartitionKey Key: SPARK-37376 URL: https://issues.apache.org/jira/browse/SPARK-37376 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun One of the pre-requisite for the feature is to allow V2 input partitions to report their partition values to Spark, which can use them to compare if both sides of join are co-partitioned, and also optionally group input partitions together. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution
Chao Sun created SPARK-37377: Summary: Refactor V2 Partitioning interface and remove deprecated usage of Distribution Key: SPARK-37377 URL: https://issues.apache.org/jira/browse/SPARK-37377 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently {{Partitioning}} is defined as follow: {code:scala} @Evolving public interface Partitioning { int numPartitions(); boolean satisfy(Distribution distribution); } {code} There are two issues with the interface: 1) it uses a deprecated {{Distribution}} interface, and should switch to {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently there is no way to use this in join where we want to compare reported partitionings from both sides and decide whether they are "compatible" (and thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
Chao Sun created SPARK-37378: Summary: Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog Key: SPARK-37378 URL: https://issues.apache.org/jira/browse/SPARK-37378 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun We need to add logic to convert a V2 {{Transform}} expression into its catalyst expression counterpart, and also load its function definition from the V2 FunctionCatalog provided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-35867. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34611 [https://github.com/apache/spark/pull/34611] > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.3.0 > > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-35867: Assignee: Kazuyuki Tanimura > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Kazuyuki Tanimura >Priority: Minor > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36529: - Attachment: image.png > Decouple CPU with IO work in vectorized Parquet reader > -- > > Key: SPARK-36529 > URL: https://issues.apache.org/jira/browse/SPARK-36529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently it seems the vectorized Parquet reader does almost everything in a > sequential manner: > 1. read the row group using file system API (perhaps from remote storage like > S3) > 2. allocate buffers and store those row group bytes into them > 3. decompress the data pages > 4. in Spark, decode all the read columns one by one > 5. read the next row group and repeat from 1. > A lot of improvements can be done to decouple the IO and CPU intensive work. > In addition, we could parallelize the row group loading and column decoding, > and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader
[ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36529: - Attachment: (was: image.png) > Decouple CPU with IO work in vectorized Parquet reader > -- > > Key: SPARK-36529 > URL: https://issues.apache.org/jira/browse/SPARK-36529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently it seems the vectorized Parquet reader does almost everything in a > sequential manner: > 1. read the row group using file system API (perhaps from remote storage like > S3) > 2. allocate buffers and store those row group bytes into them > 3. decompress the data pages > 4. in Spark, decode all the read columns one by one > 5. read the next row group and repeat from 1. > A lot of improvements can be done to decouple the IO and CPU intensive work. > In addition, we could parallelize the row group loading and column decoding, > and utilizing all the cores available for a Spark task. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37445. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34715 [https://github.com/apache/spark/pull/34715] > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37445: Assignee: angerszhu > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37205: Assignee: Chao Sun > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37205. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34635 [https://github.com/apache/spark/pull/34635] > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37561. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34822 [https://github.com/apache/spark/pull/34822] > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37561: Assignee: dzcxzl > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37600) Upgrade to Hadoop 3.3.2
Chao Sun created SPARK-37600: Summary: Upgrade to Hadoop 3.3.2 Key: SPARK-37600 URL: https://issues.apache.org/jira/browse/SPARK-37600 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Chao Sun Upgrade Spark to use Hadoop 3.3.2 once it's released. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4
[ https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37573: Assignee: angerszhu > IsolatedClient fallbackVersion should be build in version, not always 2.7.4 > > > Key: SPARK-37573 > URL: https://issues.apache.org/jira/browse/SPARK-37573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Hadoop 3 fallback to 2.7.4 cause error > {code} > [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 > seconds, 320 milliseconds) > [info] java.lang.ClassFormatError: Truncated class file > [info] at java.lang.ClassLoader.defineClass1(Native Method) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:756) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:635) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:405) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313) > [info] at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50) > [info] at > org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4
[ https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37573. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34830 [https://github.com/apache/spark/pull/34830] > IsolatedClient fallbackVersion should be build in version, not always 2.7.4 > > > Key: SPARK-37573 > URL: https://issues.apache.org/jira/browse/SPARK-37573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Hadoop 3 fallback to 2.7.4 cause error > {code} > [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 > seconds, 320 milliseconds) > [info] java.lang.ClassFormatError: Truncated class file > [info] at java.lang.ClassLoader.defineClass1(Native Method) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:756) > [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:635) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:405) > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > [info] at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313) > [info] at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50) > [info] at > org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37217. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34493 [https://github.com/apache/spark/pull/34493] > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37217: Assignee: dzcxzl > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37481: - Fix Version/s: 3.2.1 (was: 3.2.0) > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37217: - Fix Version/s: 3.2.1 > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.2.1, 3.3.0 > > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37633. -- Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 34888 [https://github.com/apache/spark/pull/34888] > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.3.0, 3.2.1 > > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37633: Assignee: Manu Zhang > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37633: - Affects Version/s: (was: 3.0.3) > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
[ https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37974. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35262 [https://github.com/apache/spark/pull/35262] > Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings > for Parquet V2 support > --- > > Key: SPARK-37974 > URL: https://issues.apache.org/jira/browse/SPARK-37974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > Fix For: 3.4.0 > > > SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer > values, but does not implement the DELTA_BYTE_ARRAY encoding which is for > string values. DELTA_BYTE_ARRAY encoding also requires the > DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized > versions as the current implementation simply calls the non-vectorized > Parquet library methods. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
[ https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37974: Assignee: Parth Chandra > Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings > for Parquet V2 support > --- > > Key: SPARK-37974 > URL: https://issues.apache.org/jira/browse/SPARK-37974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > > SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer > values, but does not implement the DELTA_BYTE_ARRAY encoding which is for > string values. DELTA_BYTE_ARRAY encoding also requires the > DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized > versions as the current implementation simply calls the non-vectorized > Parquet library methods. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
[ https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37974: - Fix Version/s: 3.3.0 (was: 3.4.0) > Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings > for Parquet V2 support > --- > > Key: SPARK-37974 > URL: https://issues.apache.org/jira/browse/SPARK-37974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > Fix For: 3.3.0 > > > SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer > values, but does not implement the DELTA_BYTE_ARRAY encoding which is for > string values. DELTA_BYTE_ARRAY encoding also requires the > DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized > versions as the current implementation simply calls the non-vectorized > Parquet library methods. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37377: - Summary: Initial implementation of Storage-Partitioned Join (was: Refactor V2 Partitioning interface and remove deprecated usage of Distribution) > Initial implementation of Storage-Partitioned Join > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Currently {{Partitioning}} is defined as follow: > {code:scala} > @Evolving > public interface Partitioning { > int numPartitions(); > boolean satisfy(Distribution distribution); > } > {code} > There are two issues with the interface: 1) it uses a deprecated > {{Distribution}} interface, and should switch to > {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently > there is no way to use this in join where we want to compare reported > partitionings from both sides and decide whether they are "compatible" (and > thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37377: - Description: This Jira tracks the initial implementation of storage-partitioned join. (was: Currently {{Partitioning}} is defined as follow: {code:scala} @Evolving public interface Partitioning { int numPartitions(); boolean satisfy(Distribution distribution); } {code} There are two issues with the interface: 1) it uses a deprecated {{Distribution}} interface, and should switch to {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently there is no way to use this in join where we want to compare reported partitionings from both sides and decide whether they are "compatible" (and thus allows Spark to eliminate shuffle). ) > Initial implementation of Storage-Partitioned Join > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > This Jira tracks the initial implementation of storage-partitioned join. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
[ https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37378. -- Resolution: Duplicate This JIRA is covered as part of SPARK-37377 > Convert V2 Transform expressions into catalyst expressions and load their > associated functions from V2 FunctionCatalog > -- > > Key: SPARK-37378 > URL: https://issues.apache.org/jira/browse/SPARK-37378 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > We need to add logic to convert a V2 {{Transform}} expression into its > catalyst expression counterpart, and also load its function definition from > the V2 FunctionCatalog provided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
[ https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37378: - Fix Version/s: 3.4.0 > Convert V2 Transform expressions into catalyst expressions and load their > associated functions from V2 FunctionCatalog > -- > > Key: SPARK-37378 > URL: https://issues.apache.org/jira/browse/SPARK-37378 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > We need to add logic to convert a V2 {{Transform}} expression into its > catalyst expression counterpart, and also load its function definition from > the V2 FunctionCatalog provided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34863) Support nested column in Spark Parquet vectorized readers
[ https://issues.apache.org/jira/browse/SPARK-34863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-34863: Assignee: Chao Sun (was: Apache Spark) > Support nested column in Spark Parquet vectorized readers > - > > Key: SPARK-34863 > URL: https://issues.apache.org/jira/browse/SPARK-34863 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Chao Sun >Priority: Minor > Fix For: 3.3.0 > > > The task is to support nested column type in Spark Parquet vectorized reader. > Currently Parquet vectorized reader does not support nested column type > (struct, array and map). We implemented nested column vectorized reader for > FB-ORC in our internal fork of Spark. We are seeing performance improvement > compared to non-vectorized reader when reading nested columns. In addition, > this can also help improve the non-nested column performance when reading > non-nested and nested columns together in one query. > > Parquet: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"
[ https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-38786: Assignee: Kazuyuki Tanimura > Test Bug in StatisticsSuite "change stats after add/drop partition command" > --- > > Key: SPARK-38786 > URL: https://issues.apache.org/jira/browse/SPARK-38786 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > > [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979] > It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste > bug. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"
[ https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38786. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36075 [https://github.com/apache/spark/pull/36075] > Test Bug in StatisticsSuite "change stats after add/drop partition command" > --- > > Key: SPARK-38786 > URL: https://issues.apache.org/jira/browse/SPARK-38786 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.4.0 > > > [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979] > It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste > bug. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38840) Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default
Chao Sun created SPARK-38840: Summary: Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default Key: SPARK-38840 URL: https://issues.apache.org/jira/browse/SPARK-38840 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Chao Sun We can enable {{spark.sql.parquet.enableNestedColumnVectorizedReader}} on master branch by default, to make sure it is covered by more tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible
Chao Sun created SPARK-38891: Summary: Skipping allocating vector for repetition & definition levels when possible Key: SPARK-38891 URL: https://issues.apache.org/jira/browse/SPARK-38891 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Chao Sun Currently the vectorized Parquet reader will allocate vectors for repetition and definition levels in all cases. However in certain cases (e.g., when reading primitive types) this is not necessary and should be avoided. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38573) Support Auto Partition Statistics Collection
[ https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38573. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36067 [https://github.com/apache/spark/pull/36067] > Support Auto Partition Statistics Collection > > > Key: SPARK-38573 > URL: https://issues.apache.org/jira/browse/SPARK-38573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.4.0 > > > Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing > the aggregated stats at table level for partitioned tables with config > spark.sql.statistics.size.autoUpdate.enabled. > Supporting partition level stats are useful to know which partitions are > outliers (skewed partition) and query optimizer works better with partition > level stats in case of partition pruning. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38573) Support Auto Partition Statistics Collection
[ https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-38573: Assignee: Kazuyuki Tanimura > Support Auto Partition Statistics Collection > > > Key: SPARK-38573 > URL: https://issues.apache.org/jira/browse/SPARK-38573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > > Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing > the aggregated stats at table level for partitioned tables with config > spark.sql.statistics.size.autoUpdate.enabled. > Supporting partition level stats are useful to know which partitions are > outliers (skewed partition) and query optimizer works better with partition > level stats in case of partition pruning. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible
[ https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38891. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36202 [https://github.com/apache/spark/pull/36202] > Skipping allocating vector for repetition & definition levels when possible > --- > > Key: SPARK-38891 > URL: https://issues.apache.org/jira/browse/SPARK-38891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Currently the vectorized Parquet reader will allocate vectors for repetition > and definition levels in all cases. However in certain cases (e.g., when > reading primitive types) this is not necessary and should be avoided. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible
[ https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-38891: Assignee: Chao Sun > Skipping allocating vector for repetition & definition levels when possible > --- > > Key: SPARK-38891 > URL: https://issues.apache.org/jira/browse/SPARK-38891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > Currently the vectorized Parquet reader will allocate vectors for repetition > and definition levels in all cases. However in certain cases (e.g., when > reading primitive types) this is not necessary and should be avoided. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org