[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2

2021-01-19 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268063#comment-17268063
 ] 

Chao Sun commented on SPARK-33507:
--

[~aokolnychyi] could you elaborate on the question? currently Spark doesn't 
support caching streaming tables yet.

> Improve and fix cache behavior in v1 and v2
> ---
>
> Key: SPARK-33507
> URL: https://issues.apache.org/jira/browse/SPARK-33507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Critical
>
> This is an umbrella JIRA to track fixes & improvements for caching behavior 
> in Spark datasource v1 and v2, which includes:
>   - fix existing cache behavior in v1 and v2.
>   - fix inconsistent cache behavior between v1 and v2
>   - implement missing features in v2 to align with those in v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34052) A cached view should become invalid after a table is dropped

2021-01-26 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272333#comment-17272333
 ] 

Chao Sun commented on SPARK-34052:
--

[~hyukjin.kwon] [~cloud_fan] do you think we should include this in 3.1.1? 
since we've changed how temp view work in SPARK-33142 it may be better to add 
this too to make it consistent.

> A cached view should become invalid after a table is dropped
> 
>
> Key: SPARK-34052
> URL: https://issues.apache.org/jira/browse/SPARK-34052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> It seems a view doesn't become invalid after a DSv2 table is dropped or 
> replaced. This is different from V1 and may cause correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2021-01-27 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273161#comment-17273161
 ] 

Chao Sun commented on SPARK-27589:
--

[~xkrogen] FWIW I'm working on a POC for SPARK-32935 at the moment. There is 
also a design doc under working. Hopefully we'll be able to share it soon. cc 
[~rdblue] too.

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34271) Use majorMinorPatchVersion for Hive version parsing

2021-01-27 Thread Chao Sun (Jira)
Chao Sun created SPARK-34271:


 Summary: Use majorMinorPatchVersion for Hive version parsing
 Key: SPARK-34271
 URL: https://issues.apache.org/jira/browse/SPARK-34271
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently {{IsolatedClientLoader}} need to enumerate all Hive patch versions. 
Therefore, whenever we upgrade Hive version we'd have to remember to update the 
method. It would be better if we just check major & minor version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases

2021-01-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34108:
-
Description: 
Currently, caching a temporary or permenant view doesn't work in certain cases. 
For instance, in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t
CACHE TABLE v1
SELECT key FROM t
{code}
The last SELECT query will hit the cached {{v1}}. On the other hand:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t ORDER by key
CACHE TABLE v1
SELECT key FROM t ORDER BY key
{code}
The SELECT won't hit the cache.

It seems this is related to {{EliminateView}}. In the second case, it will 
insert an extra project operator which makes the comparison on canonicalized 
plan during cache lookup fail.

  was:
Currently, caching a permanent view doesn't work in certain cases. For 
instance, in the following:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t
CACHE TABLE v1
SELECT key FROM t
{code}
The last SELECT query will hit the cached {{v1}}. On the other hand:
{code:sql}
CREATE TABLE t (key bigint, value string) USING parquet
CREATE VIEW v1 AS SELECT key FROM t ORDER by key
CACHE TABLE v1
SELECT key FROM t ORDER BY key
{code}
The SELECT won't hit the cache.

It seems this is related to {{EliminateView}}. In the second case, it will 
insert an extra project operator which makes the comparison on canonicalized 
plan during cache lookup fail.


> Cache lookup doesn't work in certain cases
> --
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a temporary or permenant view doesn't work in certain 
> cases. For instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34108) Cache lookup doesn't work in certain cases

2021-01-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-34108:
-
Summary: Cache lookup doesn't work in certain cases  (was: Caching with 
permanent view doesn't work in certain cases)

> Cache lookup doesn't work in certain cases
> --
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34108) Cache lookup doesn't work in certain cases

2021-01-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-34108.
--
Resolution: Duplicate

> Cache lookup doesn't work in certain cases
> --
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a temporary or permenant view doesn't work in certain 
> cases. For instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34347) CatalogImpl.uncacheTable should invalidate in cascade for temp views

2021-02-03 Thread Chao Sun (Jira)
Chao Sun created SPARK-34347:


 Summary: CatalogImpl.uncacheTable should invalidate in cascade for 
temp views 
 Key: SPARK-34347
 URL: https://issues.apache.org/jira/browse/SPARK-34347
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


When {{spark.sql.legacy.storeAnalyzedPlanForView}} is false, 
{{CatalogImpl.uncacheTable}} should invalidate caches for temp view in cascade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34419) Move PartitionTransforms from java to scala directory

2021-02-10 Thread Chao Sun (Jira)
Chao Sun created SPARK-34419:


 Summary: Move PartitionTransforms from java to scala directory
 Key: SPARK-34419
 URL: https://issues.apache.org/jira/browse/SPARK-34419
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


{{PartitionTransforms}} is currently under 
{{sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions}}. It 
should be under 
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289200#comment-17289200
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks for the report [~ouyangxc.zte]. Can you provide more details? such as 
error messages, stack traces, steps to reproduce the issue, etc?

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289652#comment-17289652
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks for the details [~ouyangxc.zte]!

{quote}
Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath
{quote}
This is interesting. the {{hadoop-client-minicluster.jar}} should only be used 
in tests - curious why it is needed here. Could you share stacktraces for the 
{{ClassNotFoundException}}?

{quote}
2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing 
SparkContext.
java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter
{quote}
Could you also share the stacktraces for this exception?

And to confirm, you are using {{client}} as the deploy mode, is that correct? 
I'll try to reproduce this in my local environment.


> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290127#comment-17290127
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks again [~ouyangxc.zte]. 
{{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}} was not included 
in the {{hadoop-client}} jars since it is a server-side class and ideally 
should not be exposed to client applications such as Spark. 

[~dongjoon] Let me see how we can fix this either in Spark or Hadoop.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613
 ] 

Chao Sun commented on SPARK-33212:
--

I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all 
the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613
 ] 

Chao Sun edited comment on SPARK-33212 at 2/25/21, 2:21 AM:


I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think only {{hadoop-yarn-server-web-proxy}} is needed by Spark 
- all the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.


was (Author: csun):
I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all 
the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290707#comment-17290707
 ] 

Chao Sun commented on SPARK-33212:
--

Yes. I think the only class Spark needs from this jar is 
{{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}}, which together 
with other two classes it depends on from the same package, do not have Guava 
dependency except {{VisibleForTesting}}.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32703:
-
Summary: Replace deprecated API calls from SpecificParquetRecordReaderBase  
(was: Enable dictionary filtering for Parquet vectorized reader)

> Replace deprecated API calls from SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32703) Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32703:
-
Description: Currently in {{SpecificParquetRecordReaderBase}} we use 
deprecated APIs in a few places from Parquet, such as {{readFooter}}, 
{{ParquetInputSplit}}, deprecated ctor for {{ParquetFileReader}}, 
{{filterRowGroups}}, etc. These are going to be removed in some of the future 
Parquet versions so we should move to the new APIs for better maintainability.  
 (was: Parquet vectorized reader still uses the old API for {{filterRowGroups}} 
and only filters on statistics. It should switch to the new API and do 
dictionary filtering as well.)

> Replace deprecated API calls from SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently in {{SpecificParquetRecordReaderBase}} we use deprecated APIs in a 
> few places from Parquet, such as {{readFooter}}, {{ParquetInputSplit}}, 
> deprecated ctor for {{ParquetFileReader}}, {{filterRowGroups}}, etc. These 
> are going to be removed in some of the future Parquet versions so we should 
> move to the new APIs for better maintainability. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-19 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305109#comment-17305109
 ] 

Chao Sun commented on SPARK-34780:
--

Thanks for the reporting [~mikechen], the test case you provided is very 
useful. 

I'm not sure, though, how severe is the issue since it only affects 
{{computeStats}}, and when the cache is actually materialized (e.g., via 
{{df2.count()}} after {{df2.cache()}}), the value from {{computeStats}} will be 
different anyways. Could you give more details?

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30497) migrate DESCRIBE TABLE to the new framework

2021-03-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308067#comment-17308067
 ] 

Chao Sun commented on SPARK-30497:
--

[~cloud_fan] this is resolved right?

> migrate DESCRIBE TABLE to the new framework
> ---
>
> Key: SPARK-30497
> URL: https://issues.apache.org/jira/browse/SPARK-30497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308262#comment-17308262
 ] 

Chao Sun commented on SPARK-34780:
--

Sorry for the late reply [~mikechen]! There's something I still not quite 
clear: when the cache is retrieved, a {{InMemoryRelation}} will be used to 
replace the plan fragment that is matched. Therefore, how can the old stale 
conf still be used in places like {{DataSourceScanExec}}?

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-25 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308854#comment-17308854
 ] 

Chao Sun commented on SPARK-34780:
--

[~mikechen], yes you're right. I'm not sure if this is a big concern though, 
since it just means the plan fragment for the cache is executed with the stale 
conf. I guess as long as there is no correctness issue (which I'd be surprised 
to see if there's any), it should be fine?

It seems a bit tricky to fix the issue, since the {{SparkSession}} is leaked to 
many places. I guess one way is to follow the idea of SPARK-33389 and change 
{{SessionState}} to always use the active conf. 

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Chao Sun (Jira)
Chao Sun created SPARK-36820:


 Summary: Disable LZ4 test for Hadoop 2.7 profile
 Key: SPARK-36820
 URL: https://issues.apache.org/jira/browse/SPARK-36820
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
{{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36820:
-
Issue Type: Test  (was: Bug)

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36828) Remove Guava from Spark binary distribution

2021-09-22 Thread Chao Sun (Jira)
Chao Sun created SPARK-36828:


 Summary: Remove Guava from Spark binary distribution
 Key: SPARK-36828
 URL: https://issues.apache.org/jira/browse/SPARK-36828
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


After SPARK-36676, we should consider removing Guava from Spark's binary 
distribution. It is currently only required by a few libraries such as 
curator-client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36828) Remove Guava from Spark binary distribution

2021-09-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36828:
-
Issue Type: Improvement  (was: Bug)

> Remove Guava from Spark binary distribution
> ---
>
> Key: SPARK-36828
> URL: https://issues.apache.org/jira/browse/SPARK-36828
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> After SPARK-36676, we should consider removing Guava from Spark's binary 
> distribution. It is currently only required by a few libraries such as 
> curator-client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36835) Spark 3.2.0 POMs are no longer "dependency reduced"

2021-09-23 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419499#comment-17419499
 ] 

Chao Sun commented on SPARK-36835:
--

Sorry for the regression [~joshrosen]. I forgot exactly why I added that but 
let me see if we can safely revert it.

> Spark 3.2.0 POMs are no longer "dependency reduced"
> ---
>
> Key: SPARK-36835
> URL: https://issues.apache.org/jira/browse/SPARK-36835
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> It looks like Spark 3.2.0's POMs are no longer "dependency reduced". As a 
> result, applications may pull in additional unnecessary dependencies when 
> depending on Spark.
> Spark uses the Maven Shade plugin to create effective POMs and to bundle 
> shaded versions of certain libraries with Spark (namely, Jetty, Guava, and 
> JPPML). [By 
> default|https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html#createDependencyReducedPom],
>  the Maven Shade plugin generates simplified POMs which remove dependencies 
> on artifacts that have been shaded.
> SPARK-33212 / 
> [b6f46ca29742029efea2790af7fdefbc2fcf52de|https://github.com/apache/spark/commit/b6f46ca29742029efea2790af7fdefbc2fcf52de]
>  changed the configuration of the Maven Shade plugin, setting 
> {{createDependencyReducedPom}} to {{false}}.
> As a result, the generated POMs now include compile-scope dependencies on the 
> shaded libraries. For example, compare the {{org.eclipse.jetty}} dependencies 
> in:
>  * Spark 3.1.2: 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.2/spark-core_2.12-3.1.2.pom]
>  * Spark 3.2.0 RC2: 
> [https://repository.apache.org/content/repositories/orgapachespark-1390/org/apache/spark/spark-core_2.12/3.2.0/spark-core_2.12-3.2.0.pom]
> I think we should revert back to generating "dependency reduced" POMs to 
> ensure that Spark declares a proper set of dependencies and to avoid "unknown 
> unknown" consequences of changing our generated POM format.
> /cc [~csun]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36863) Update dependency manifests for all released artifacts

2021-09-27 Thread Chao Sun (Jira)
Chao Sun created SPARK-36863:


 Summary: Update dependency manifests for all released artifacts
 Key: SPARK-36863
 URL: https://issues.apache.org/jira/browse/SPARK-36863
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


We should update dependency manifests for all released artifacts. Currently we 
don't do for modules such as {{hadoop-cloud}}, {{kinesis-asl}} etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)
Chao Sun created SPARK-36873:


 Summary: Add provided Guava dependency for network-yarn module
 Key: SPARK-36873
 URL: https://issues.apache.org/jira/browse/SPARK-36873
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Chao Sun


In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we have 
moved to shaded Hadoop client which no longer expose the transitive guava 
dependency. This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36873:
-
Description: 
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we moved 
to shaded Hadoop client which no longer expose the transitive guava dependency. 
This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}

  was:
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we have 
moved to shaded Hadoop client which no longer expose the transitive guava 
dependency. This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}


> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which got changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8

[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36873:
-
Description: 
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which was changed by SPARK-33212 where we moved 
to shaded Hadoop client which no longer expose the transitive guava dependency. 
This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}

  was:
In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
from hadoop-client dependency, which got changed by SPARK-33212 where we moved 
to shaded Hadoop client which no longer expose the transitive guava dependency. 
This was fine for a while since we were not using 
{{createDependencyReducedPom}} so the module picks up the transitive dependency 
from {{spark-network-common}}. However, this got changed by SPARK-36835 when we 
restored {{createDependencyReducedPom}} and now it is no longer able to find 
guava classes:
{code}
mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl 
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to 
/Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
 package com.google.common.annotations does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
 package com.google.common.base does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
 package com.google.common.collect does not exist
[ERROR] [Error] 
/Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
 cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
{code}


> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ER

[jira] [Updated] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36873:
-
Issue Type: Bug  (was: Improvement)

> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>  package com.google.common.annotations does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>  package com.google.common.base does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>  package com.google.common.collect does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
>  cannot find symbol
>   symbol:   class VisibleForTesting
>   location: class org.apache.spark.network.yarn.YarnShuffleService
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-09-28 Thread Chao Sun (Jira)
Chao Sun created SPARK-36879:


 Summary: Support Parquet v2 data page encodings for the vectorized 
path
 Key: SPARK-36879
 URL: https://issues.apache.org/jira/browse/SPARK-36879
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) 
in the vectorized path, and throws exception otherwise:
{code}
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
{code}

It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36891) Add new test suite to cover Parquet decoding

2021-09-29 Thread Chao Sun (Jira)
Chao Sun created SPARK-36891:


 Summary: Add new test suite to cover Parquet decoding
 Key: SPARK-36891
 URL: https://issues.apache.org/jira/browse/SPARK-36891
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Add a new test suite to add more coverage for Parquet vectorized decoding, 
focusing on different combinations of Parquet column index, dictionary, batch 
size, page size, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-10-05 Thread Chao Sun (Jira)
Chao Sun created SPARK-36935:


 Summary: Enhance ParquetSchemaConverter to capture Parquet 
repetition & definition level
 Key: SPARK-36935
 URL: https://issues.apache.org/jira/browse/SPARK-36935
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


In order to support complex type for Parquet vectorized reader, we'll need to 
capture the repetition & definition level information associated with Catalyst 
Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36891) Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding

2021-10-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36891:
-
Parent: SPARK-35743
Issue Type: Sub-task  (was: Test)

> Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized 
> Parquet decoding
> -
>
> Key: SPARK-36891
> URL: https://issues.apache.org/jira/browse/SPARK-36891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> Add a new test suite to add more coverage for Parquet vectorized decoding, 
> focusing on different combinations of Parquet column index, dictionary, batch 
> size, page size, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories

2021-10-06 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425162#comment-17425162
 ] 

Chao Sun commented on SPARK-36936:
--

[~colin.williams] which version of {{spark-hadoop-cloud}} you were using? I 
think the above error shouldn't happen if the version is the same as the 
Spark's version.

We've already started to publish {{spark-hadoop-cloud}} as part of the Spark 
release procedure, see SPARK-35844.

> spark-hadoop-cloud broken on release and only published via 3rd party 
> repositories
> --
>
> Key: SPARK-36936
> URL: https://issues.apache.org/jira/browse/SPARK-36936
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1, 3.1.2
> Environment: name:=spark-demo
> version := "0.0.1"
> scalaVersion := "2.12.12"
> lazy val app = (project in file("app")).settings(
>  assemblyPackageScala / assembleArtifact := false,
>  assembly / assemblyJarName := "uber.jar",
>  assembly / mainClass := Some("com.example.Main"),
>  // more settings here ...
>  )
> resolvers += "Cloudera" at 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % 
> "3.1.1.3.1.7270.0-253"
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % 
> "3.1.1.7.2.7.0-184"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
> libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
> // test suite settings
> fork in Test := true
> javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", 
> "-XX:+CMSClassUnloadingEnabled")
> // Show runtime of tests
> testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
> ___
>  
> import org.apache.spark.sql.SparkSession
> object SparkApp {
>  def main(args: Array[String]){
>  val spark = SparkSession.builder().master("local")
>  //.config("spark.jars.repositories", 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";)
>  //.config("spark.jars.packages", 
> "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
>  .appName("spark session").getOrCreate
>  val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
>  val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
>  jsonDF.show()
>  csvDF.show()
>  }
> }
>Reporter: Colin Williams
>Priority: Major
>
> The spark docmentation suggests using `spark-hadoop-cloud` to read / write 
> from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . 
> However artifacts are currently published via only 3rd party resolvers in 
> [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] 
> including Cloudera and Palantir.
>  
> Then apache spark documentation is providing a 3rd party solution for object 
> stores including S3. Furthermore, if you follow the instructions and include 
> one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release 
> and try to access object store, the following exception is returned.
>  
> ```
> Exception in thread "main" java.lang.NoSuchMethodError: 'void 
> com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
> java.lang.Object, java.lang.Object)'
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
>  at 
> org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
>  at org.apache.spark.sql.DataFrameRead

[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories

2021-10-08 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426255#comment-17426255
 ] 

Chao Sun commented on SPARK-36936:
--

[~colin.williams] Spark 3.2.0 is not released yet - it will be there soon.

> spark-hadoop-cloud broken on release and only published via 3rd party 
> repositories
> --
>
> Key: SPARK-36936
> URL: https://issues.apache.org/jira/browse/SPARK-36936
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1, 3.1.2
> Environment: name:=spark-demo
> version := "0.0.1"
> scalaVersion := "2.12.12"
> lazy val app = (project in file("app")).settings(
>  assemblyPackageScala / assembleArtifact := false,
>  assembly / assemblyJarName := "uber.jar",
>  assembly / mainClass := Some("com.example.Main"),
>  // more settings here ...
>  )
> resolvers += "Cloudera" at 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % 
> "3.1.1.3.1.7270.0-253"
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % 
> "3.1.1.7.2.7.0-184"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
> libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
> // test suite settings
> fork in Test := true
> javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", 
> "-XX:+CMSClassUnloadingEnabled")
> // Show runtime of tests
> testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
> ___
>  
> import org.apache.spark.sql.SparkSession
> object SparkApp {
>  def main(args: Array[String]){
>  val spark = SparkSession.builder().master("local")
>  //.config("spark.jars.repositories", 
> "https://repository.cloudera.com/artifactory/cloudera-repos/";)
>  //.config("spark.jars.packages", 
> "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
>  .appName("spark session").getOrCreate
>  val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
>  val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
>  jsonDF.show()
>  csvDF.show()
>  }
> }
>Reporter: Colin Williams
>Priority: Major
>
> The spark docmentation suggests using `spark-hadoop-cloud` to read / write 
> from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . 
> However artifacts are currently published via only 3rd party resolvers in 
> [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] 
> including Cloudera and Palantir.
>  
> Then apache spark documentation is providing a 3rd party solution for object 
> stores including S3. Furthermore, if you follow the instructions and include 
> one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release 
> and try to access object store, the following exception is returned.
>  
> ```
> Exception in thread "main" java.lang.NoSuchMethodError: 'void 
> com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
> java.lang.Object, java.lang.Object)'
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
>  at 
> org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>  at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
> ```
> It looks like there are classpath conflicts using the cloudera published 
> `spark-hadoop-cloud` with spark 3.1.2, again contradicting the documentation.
> Then the 

[jira] [Commented] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

2021-10-13 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428522#comment-17428522
 ] 

Chao Sun commented on SPARK-35640:
--

[~catalinii] this change seems unrelated since it's only in Spark 3.2.0, but 
you mentioned the issue also happens in Spark 3.1.2. The issue seems to be also 
well-known, see SPARK-16544.

> Refactor Parquet vectorized reader to remove duplicated code paths
> --
>
> Key: SPARK-35640
> URL: https://issues.apache.org/jira/browse/SPARK-35640
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently in Parquet vectorized code path, there are many code duplications 
> such as the following:
> {code:java}
>   public void readIntegers(
>   int total,
>   WritableColumnVector c,
>   int rowId,
>   int level,
>   VectorizedValuesReader data) throws IOException {
> int left = total;
> while (left > 0) {
>   if (this.currentCount == 0) this.readNextGroup();
>   int n = Math.min(left, this.currentCount);
>   switch (mode) {
> case RLE:
>   if (currentValue == level) {
> data.readIntegers(n, c, rowId);
>   } else {
> c.putNulls(rowId, n);
>   }
>   break;
> case PACKED:
>   for (int i = 0; i < n; ++i) {
> if (currentBuffer[currentBufferIdx++] == level) {
>   c.putInt(rowId + i, data.readInteger());
> } else {
>   c.putNull(rowId + i);
> }
>   }
>   break;
>   }
>   rowId += n;
>   left -= n;
>   currentCount -= n;
> }
>   }
> {code}
> This makes it hard to maintain as any change on this will need to be 
> replicated in 20+ places. The issue becomes more serious when we are going to 
> implement column index and complex type support for the vectorized path.
> The original intention is for performance. However now days JIT compilers 
> tend to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37069) HiveClientImpl throws NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns

2021-10-21 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432624#comment-17432624
 ] 

Chao Sun commented on SPARK-37069:
--

Thanks for the ping [~zhouyifan279]! yes this is a bug, and let me see how to 
fix it.

> HiveClientImpl throws NoSuchMethodError: 
> org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns
> --
>
> Key: SPARK-37069
> URL: https://issues.apache.org/jira/browse/SPARK-37069
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Zhou Yifan
>Priority: Major
>
> If we run Spark SQL with external Hive 2.3.x (before 2.3.9) jars, following 
> error will be thrown:
> {code:java}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;Exception
>  in thread "main" java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getHive$1(HiveClientImpl.scala:205)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getHive(HiveClientImpl.scala:204)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:267)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:292)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:394)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224)
>  at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>  at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>  at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>  at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:170)
>  at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:168)
>  at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:61)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:119)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:119)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:1004)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:990)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:982)
>  at 
> org.apache.spark.sql.execution.command.ShowTablesCommand.$anonfun$run$42(tables.scala:828)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.command.ShowTablesCommand.run(tables.scala:828)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(Q

[jira] [Updated] (SPARK-35703) Relax constraint for Spark bucket join and remove HashClusteredDistribution

2021-10-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35703:
-
Summary: Relax constraint for Spark bucket join and remove 
HashClusteredDistribution  (was: Remove HashClusteredDistribution)

> Relax constraint for Spark bucket join and remove HashClusteredDistribution
> ---
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37113) Upgrade Parquet to 1.12.2

2021-10-25 Thread Chao Sun (Jira)
Chao Sun created SPARK-37113:


 Summary: Upgrade Parquet to 1.12.2
 Key: SPARK-37113
 URL: https://issues.apache.org/jira/browse/SPARK-37113
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Upgrade Parquet version to 1.12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37166) SPIP: Storage Partitioned Join

2021-10-29 Thread Chao Sun (Jira)
Chao Sun created SPARK-37166:


 Summary: SPIP: Storage Partitioned Join
 Key: SPARK-37166
 URL: https://issues.apache.org/jira/browse/SPARK-37166
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-01 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436963#comment-17436963
 ] 

Chao Sun commented on SPARK-37166:
--

[~xkrogen] sure just linked.

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-11-03 Thread Chao Sun (Jira)
Chao Sun created SPARK-37205:


 Summary: Support mapreduce.job.send-token-conf when starting 
containers in YARN
 Key: SPARK-37205
 URL: https://issues.apache.org/jira/browse/SPARK-37205
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 3.3.0
Reporter: Chao Sun


{{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
[YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
not required to statically have config for all the secure HDFS clusters. 
Currently it only works for MRv2 but it'd be nice if Spark can also use this 
feature. I think we only need to pass the config to {{LaunchContainerContext}} 
before invoking {{NMClient.startContainer}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-11-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37205:
-
Description: {{mapreduce.job.send-token-conf}} is a useful feature in 
Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with 
which RM is not required to statically have config for all the secure HDFS 
clusters. Currently it only works for MRv2 but it'd be nice if Spark can also 
use this feature. I think we only need to pass the config to 
{{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.  (was: 
{{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
[YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
not required to statically have config for all the secure HDFS clusters. 
Currently it only works for MRv2 but it'd be nice if Spark can also use this 
feature. I think we only need to pass the config to {{LaunchContainerContext}} 
before invoking {{NMClient.startContainer}}.)

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark

2021-11-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37218.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34496
[https://github.com/apache/spark/pull/34496]

> Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
> --
>
> Key: SPARK-37218
> URL: https://issues.apache.org/jira/browse/SPARK-37218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37218) Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark

2021-11-05 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439554#comment-17439554
 ] 

Chao Sun commented on SPARK-37218:
--

[~dongjoon] please assign this to yourself - somehow I can't do it.

> Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
> --
>
> Key: SPARK-37218
> URL: https://issues.apache.org/jira/browse/SPARK-37218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down

2021-11-06 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37220.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

> Do not split input file for Parquet reader with aggregate push down
> ---
>
> Key: SPARK-37220
> URL: https://issues.apache.org/jira/browse/SPARK-37220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a followup of 
> [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC 
> aggregate push down, we can disallow split input files for Parquet reader as 
> well. See original comment for motivation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37220) Do not split input file for Parquet reader with aggregate push down

2021-11-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440042#comment-17440042
 ] 

Chao Sun commented on SPARK-37220:
--

Thanks [~hyukjin.kwon]!

> Do not split input file for Parquet reader with aggregate push down
> ---
>
> Key: SPARK-37220
> URL: https://issues.apache.org/jira/browse/SPARK-37220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a followup of 
> [https://github.com/apache/spark/pull/34298/files#r734795801,] Similar to ORC 
> aggregate push down, we can disallow split input files for Parquet reader as 
> well. See original comment for motivation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36998) Handle concurrent eviction of same application in SHS

2021-11-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-36998:


Assignee: Thejdeep Gudivada  (was: Thejdeep)

> Handle concurrent eviction of same application in SHS
> -
>
> Key: SPARK-36998
> URL: https://issues.apache.org/jira/browse/SPARK-36998
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Assignee: Thejdeep Gudivada
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> SHS throws this exception when trying to make room for parsing of a log file. 
> Reason for this is - there is a race condition to make space for processing 
> of two log files and the deleteDirectory method is overlapping.
> {code:java}
> 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause 
> usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN 
> HttpChannel: handleException 
> /api/v1/applications/application_1632281309592_2767775/1/jobs 
> java.io.IOException : Unable to delete directory 
> /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb.
>  21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError 
> javax.servlet.ServletException: 
> org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable 
> to delete directory /grid 
> /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) 
> at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) 
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
>  at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) 
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
>  at 
> org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>  at org.sparkproject.jetty.server.Server.handle(Server.java:516) at 
> org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>  at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) 
> at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at 
> org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
>  at 
> org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>  at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at 
> org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
>  at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36998) Handle concurrent eviction of same application in SHS

2021-11-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440066#comment-17440066
 ] 

Chao Sun commented on SPARK-36998:
--

Fixed

> Handle concurrent eviction of same application in SHS
> -
>
> Key: SPARK-36998
> URL: https://issues.apache.org/jira/browse/SPARK-36998
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Assignee: Thejdeep Gudivada
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> SHS throws this exception when trying to make room for parsing of a log file. 
> Reason for this is - there is a race condition to make space for processing 
> of two log files and the deleteDirectory method is overlapping.
> {code:java}
> 21/10/13 09:13:54 INFO HistoryServerDiskManager: Lease of 49.0 KiB may cause 
> usage to exceed max (101.7 GiB > 100.0 GiB) 21/10/13 09:13:54 WARN 
> HttpChannel: handleException 
> /api/v1/applications/application_1632281309592_2767775/1/jobs 
> java.io.IOException : Unable to delete directory 
> /grid/spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb.
>  21/10/13 09:13:54 WARN HttpChannelState: unhandled due to prior sendError 
> javax.servlet.ServletException: 
> org.glassfish.jersey.server.ContainerException: java.io.IOException: Unable 
> to delete directory /grid 
> /spark/sparkhistory-leveldb/apps/application_1631288241341_3657151_1.ldb. at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) 
> at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) 
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
>  at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
>  at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:791) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) 
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763)
>  at 
> org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
>  at 
> org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>  at org.sparkproject.jetty.server.Server.handle(Server.java:516) at 
> org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
>  at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) 
> at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380) at 
> org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
>  at 
> org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>  at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105) at 
> org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
>  at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-11-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-35437:


Assignee: dzcxzl

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-11-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-35437.
--
Resolution: Fixed

Issue resolved by pull request 34431
[https://github.com/apache/spark/pull/34431]

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-11-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35437:
-
Priority: Major  (was: Minor)

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Major
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37239:


Assignee: Yang Jie

> Avoid unnecessary `setReplication` in Yarn mode
> ---
>
> Key: SPARK-37239
> URL: https://issues.apache.org/jira/browse/SPARK-37239
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: wang-zhun
>Assignee: Yang Jie
>Priority: Major
>
> We found a large number of replication logs in hdfs server   
> ```
> 2021-11-04,17:22:13,065 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip
> 2021-11-04,17:22:13,069 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip
> 2021-11-04,17:22:13,070 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip
> ```
> https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439
>   
> `setReplication` needs to acquire write lock, we should reduce this 
> unnecessary operation



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37239) Avoid unnecessary `setReplication` in Yarn mode

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37239.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34520
[https://github.com/apache/spark/pull/34520]

> Avoid unnecessary `setReplication` in Yarn mode
> ---
>
> Key: SPARK-37239
> URL: https://issues.apache.org/jira/browse/SPARK-37239
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: wang-zhun
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> We found a large number of replication logs in hdfs server   
> ```
> 2021-11-04,17:22:13,065 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144379/__spark_libs__303253482044663796.zip
> 2021-11-04,17:22:13,069 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144383/__spark_libs__4747402134564993861.zip
> 2021-11-04,17:22:13,070 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Replication remains 
> unchanged at 3 for 
> xxx/.sparkStaging/application_1635470728320_1144373/__spark_libs__4377509773730188331.zip
> ```
> https://github.com/apache/hadoop/blob/6f7b965808f71f44e2617c50d366a6375fdfbbfa/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2439
>   
> `setReplication` needs to acquire write lock, we should reduce this 
> unnecessary operation



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Chao Sun (Jira)
Chao Sun created SPARK-37342:


 Summary: Upgrade Apache Arrow to 6.0.0
 Key: SPARK-37342
 URL: https://issues.apache.org/jira/browse/SPARK-37342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Chao Sun


Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37342:
-
Component/s: Build
 (was: Spark Core)

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-18 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37166.
--
Fix Version/s: 3.3.0
 Assignee: Chao Sun
   Resolution: Fixed

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37375) Umbrella: Storage Partitioned Join

2021-11-18 Thread Chao Sun (Jira)
Chao Sun created SPARK-37375:


 Summary: Umbrella: Storage Partitioned Join
 Key: SPARK-37375
 URL: https://issues.apache.org/jira/browse/SPARK-37375
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


This umbrella JIRA tracks the progress of implementing Storage Partitioned Join 
feature for Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-18 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37166:
-
Parent: SPARK-37375
Issue Type: Sub-task  (was: New Feature)

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37376) Introduce a new DataSource V2 interface HasPartitionKey

2021-11-18 Thread Chao Sun (Jira)
Chao Sun created SPARK-37376:


 Summary: Introduce a new DataSource V2 interface HasPartitionKey 
 Key: SPARK-37376
 URL: https://issues.apache.org/jira/browse/SPARK-37376
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


One of the pre-requisite for the feature is to allow V2 input partitions to 
report their partition values to Spark, which can use them to compare if both 
sides of join are co-partitioned, and also optionally group input partitions 
together.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution

2021-11-18 Thread Chao Sun (Jira)
Chao Sun created SPARK-37377:


 Summary: Refactor V2 Partitioning interface and remove deprecated 
usage of Distribution
 Key: SPARK-37377
 URL: https://issues.apache.org/jira/browse/SPARK-37377
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently {{Partitioning}} is defined as follow:
{code:scala}
@Evolving
public interface Partitioning {
  int numPartitions();
  boolean satisfy(Distribution distribution);
}
{code}

There are two issues with the interface: 1) it uses a deprecated 
{{Distribution}} interface, and should switch to 
{{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
there is no way to use this in join where we want to compare reported 
partitionings from both sides and decide whether they are "compatible" (and 
thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog

2021-11-18 Thread Chao Sun (Jira)
Chao Sun created SPARK-37378:


 Summary: Convert V2 Transform expressions into catalyst 
expressions and load their associated functions from V2 FunctionCatalog
 Key: SPARK-37378
 URL: https://issues.apache.org/jira/browse/SPARK-37378
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


We need to add logic to convert a V2 {{Transform}} expression into its catalyst 
expression counterpart, and also load its function definition from the V2 
FunctionCatalog provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-35867.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34611
[https://github.com/apache/spark/pull/34611]

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-35867:


Assignee: Kazuyuki Tanimura

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2021-12-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36529:
-
Attachment: image.png

> Decouple CPU with IO work in vectorized Parquet reader
> --
>
> Key: SPARK-36529
> URL: https://issues.apache.org/jira/browse/SPARK-36529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a 
> sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like 
> S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. 
> In addition, we could parallelize the row group loading and column decoding, 
> and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

2021-12-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36529:
-
Attachment: (was: image.png)

> Decouple CPU with IO work in vectorized Parquet reader
> --
>
> Key: SPARK-36529
> URL: https://issues.apache.org/jira/browse/SPARK-36529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a 
> sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like 
> S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. 
> In addition, we could parallelize the row group loading and column decoding, 
> and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37445) Update hadoop-profile

2021-12-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37445.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34715
[https://github.com/apache/spark/pull/34715]

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37445) Update hadoop-profile

2021-12-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37445:


Assignee: angerszhu

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37205:


Assignee: Chao Sun

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37205.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34635
[https://github.com/apache/spark/pull/34635]

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37561.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34822
[https://github.com/apache/spark/pull/34822]

> Avoid loading all functions when obtaining hive's DelegationToken
> -
>
> Key: SPARK-37561
> URL: https://issues.apache.org/jira/browse/SPARK-37561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
> Attachments: getDelegationToken_load_functions.png
>
>
> At present, when obtaining the delegationToken of hive, all functions will be 
> loaded.
> This is unnecessary, it takes time to load the function, and it also 
> increases the burden on the hive meta store.
>  
> !getDelegationToken_load_functions.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken

2021-12-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37561:


Assignee: dzcxzl

> Avoid loading all functions when obtaining hive's DelegationToken
> -
>
> Key: SPARK-37561
> URL: https://issues.apache.org/jira/browse/SPARK-37561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Attachments: getDelegationToken_load_functions.png
>
>
> At present, when obtaining the delegationToken of hive, all functions will be 
> loaded.
> This is unnecessary, it takes time to load the function, and it also 
> increases the burden on the hive meta store.
>  
> !getDelegationToken_load_functions.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37600) Upgrade to Hadoop 3.3.2

2021-12-09 Thread Chao Sun (Jira)
Chao Sun created SPARK-37600:


 Summary: Upgrade to Hadoop 3.3.2
 Key: SPARK-37600
 URL: https://issues.apache.org/jira/browse/SPARK-37600
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


Upgrade Spark to use Hadoop 3.3.2 once it's released.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4

2021-12-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37573:


Assignee: angerszhu

> IsolatedClient  fallbackVersion should be build in version, not always 2.7.4
> 
>
> Key: SPARK-37573
> URL: https://issues.apache.org/jira/browse/SPARK-37573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Hadoop 3 fallback to 2.7.4 cause error
> {code}
> [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 
> seconds, 320 milliseconds)
> [info]   java.lang.ClassFormatError: Truncated class file
> [info]   at java.lang.ClassLoader.defineClass1(Native Method)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:635)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:405)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313)
> [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> [info]   at 
> org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37573) IsolatedClient fallbackVersion should be build in version, not always 2.7.4

2021-12-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37573.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34830
[https://github.com/apache/spark/pull/34830]

> IsolatedClient  fallbackVersion should be build in version, not always 2.7.4
> 
>
> Key: SPARK-37573
> URL: https://issues.apache.org/jira/browse/SPARK-37573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Hadoop 3 fallback to 2.7.4 cause error
> {code}
> [info] org.apache.spark.sql.hive.client.VersionsSuite *** ABORTED *** (31 
> seconds, 320 milliseconds)
> [info]   java.lang.ClassFormatError: Truncated class file
> [info]   at java.lang.ClassLoader.defineClass1(Native Method)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
> [info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:635)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:266)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:258)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:405)
> [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> [info]   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:313)
> [info]   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:50)
> [info]   at 
> org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$2(VersionsSuite.scala:82)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37217.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34493
[https://github.com/apache/spark/pull/34493]

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37217:


Assignee: dzcxzl

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting

2021-12-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37481:
-
Fix Version/s: 3.2.1
   (was: 3.2.0)

> Disappearance of skipped stages mislead the bug hunting 
> 
>
> Key: SPARK-37481
> URL: https://issues.apache.org/jira/browse/SPARK-37481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> # 
>  ## With FetchFailedException and Map Stage Retries
> When rerunning spark-sql shell with the original SQL in 
> [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315]
> !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png!
> 1. stage 3 threw FetchFailedException and caused itself and its parent 
> stage(stage 2) to retry
> 2. stage 2 was skipped before but its attemptId was still 0, so when its 
> retry happened it got removed from `Skipped Stages` 
> The DAG of Job 2 doesn't show that stage 2 is skipped anymore.
> !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png!
> Besides, a retried stage usually has a subset of tasks from the original 
> stage. If we mark it as an original one, the metrics might lead us into 
> pitfalls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables

2021-12-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37217:
-
Fix Version/s: 3.2.1

> The number of dynamic partitions should early check when writing to external 
> tables
> ---
>
> Key: SPARK-37217
> URL: https://issues.apache.org/jira/browse/SPARK-37217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.2.1, 3.3.0
>
>
> [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a 
> mechanism that writes to external tables is a dynamic partition method, and 
> the data in the target partition will be deleted first.
> Assuming that 1001 partitions are written, the data of 10001 partitions will 
> be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by 
> default, loadDynamicPartitions will fail at this time, but the data of 1001 
> partitions has been deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37633.
--
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 34888
[https://github.com/apache/spark/pull/34888]

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37633:


Assignee: Manu Zhang

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37633:
-
Affects Version/s: (was: 3.0.3)

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-31 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37974.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35262
[https://github.com/apache/spark/pull/35262]

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-31 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37974:


Assignee: Parth Chandra

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-03-31 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37974:
-
Fix Version/s: 3.3.0
   (was: 3.4.0)

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join

2022-04-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37377:
-
Summary: Initial implementation of Storage-Partitioned Join  (was: Refactor 
V2 Partitioning interface and remove deprecated usage of Distribution)

> Initial implementation of Storage-Partitioned Join
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently {{Partitioning}} is defined as follow:
> {code:scala}
> @Evolving
> public interface Partitioning {
>   int numPartitions();
>   boolean satisfy(Distribution distribution);
> }
> {code}
> There are two issues with the interface: 1) it uses a deprecated 
> {{Distribution}} interface, and should switch to 
> {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
> there is no way to use this in join where we want to compare reported 
> partitionings from both sides and decide whether they are "compatible" (and 
> thus allows Spark to eliminate shuffle). 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37377) Initial implementation of Storage-Partitioned Join

2022-04-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37377:
-
Description: This Jira tracks the initial implementation of 
storage-partitioned join.  (was: Currently {{Partitioning}} is defined as 
follow:
{code:scala}
@Evolving
public interface Partitioning {
  int numPartitions();
  boolean satisfy(Distribution distribution);
}
{code}

There are two issues with the interface: 1) it uses a deprecated 
{{Distribution}} interface, and should switch to 
{{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently 
there is no way to use this in join where we want to compare reported 
partitionings from both sides and decide whether they are "compatible" (and 
thus allows Spark to eliminate shuffle). )

> Initial implementation of Storage-Partitioned Join
> --
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> This Jira tracks the initial implementation of storage-partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog

2022-04-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37378.
--
Resolution: Duplicate

This JIRA is covered as part of SPARK-37377

> Convert V2 Transform expressions into catalyst expressions and load their 
> associated functions from V2 FunctionCatalog
> --
>
> Key: SPARK-37378
> URL: https://issues.apache.org/jira/browse/SPARK-37378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> We need to add logic to convert a V2 {{Transform}} expression into its 
> catalyst expression counterpart, and also load its function definition from 
> the V2 FunctionCatalog provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37378) Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog

2022-04-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37378:
-
Fix Version/s: 3.4.0

> Convert V2 Transform expressions into catalyst expressions and load their 
> associated functions from V2 FunctionCatalog
> --
>
> Key: SPARK-37378
> URL: https://issues.apache.org/jira/browse/SPARK-37378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> We need to add logic to convert a V2 {{Transform}} expression into its 
> catalyst expression counterpart, and also load its function definition from 
> the V2 FunctionCatalog provided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34863) Support nested column in Spark Parquet vectorized readers

2022-04-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-34863:


Assignee: Chao Sun  (was: Apache Spark)

> Support nested column in Spark Parquet vectorized readers
> -
>
> Key: SPARK-34863
> URL: https://issues.apache.org/jira/browse/SPARK-34863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.3.0
>
>
> The task is to support nested column type in Spark Parquet vectorized reader. 
> Currently Parquet vectorized reader does not support nested column type 
> (struct, array and map). We implemented nested column vectorized reader for 
> FB-ORC in our internal fork of Spark. We are seeing performance improvement 
> compared to non-vectorized reader when reading nested columns. In addition, 
> this can also help improve the non-nested column performance when reading 
> non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"

2022-04-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-38786:


Assignee: Kazuyuki Tanimura

> Test Bug in StatisticsSuite "change stats after add/drop partition command"
> ---
>
> Key: SPARK-38786
> URL: https://issues.apache.org/jira/browse/SPARK-38786
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979]
> It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste 
> bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"

2022-04-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38786.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36075
[https://github.com/apache/spark/pull/36075]

> Test Bug in StatisticsSuite "change stats after add/drop partition command"
> ---
>
> Key: SPARK-38786
> URL: https://issues.apache.org/jira/browse/SPARK-38786
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.4.0
>
>
> [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979]
> It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste 
> bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38840) Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default

2022-04-08 Thread Chao Sun (Jira)
Chao Sun created SPARK-38840:


 Summary: Enable 
spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default
 Key: SPARK-38840
 URL: https://issues.apache.org/jira/browse/SPARK-38840
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Chao Sun


We can enable {{spark.sql.parquet.enableNestedColumnVectorizedReader}} on 
master branch by default, to make sure it is covered by more tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-04-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-38891:


 Summary: Skipping allocating vector for repetition & definition 
levels when possible
 Key: SPARK-38891
 URL: https://issues.apache.org/jira/browse/SPARK-38891
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently the vectorized Parquet reader will allocate vectors for repetition 
and definition levels in all cases. However in certain cases (e.g., when 
reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38573) Support Auto Partition Statistics Collection

2022-04-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38573.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36067
[https://github.com/apache/spark/pull/36067]

> Support Auto Partition Statistics Collection
> 
>
> Key: SPARK-38573
> URL: https://issues.apache.org/jira/browse/SPARK-38573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
> the aggregated stats at table level for partitioned tables with config 
> spark.sql.statistics.size.autoUpdate.enabled.
> Supporting partition level stats are useful to know which partitions are 
> outliers (skewed partition) and query optimizer works better with partition 
> level stats in case of partition pruning.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38573) Support Auto Partition Statistics Collection

2022-04-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-38573:


Assignee: Kazuyuki Tanimura

> Support Auto Partition Statistics Collection
> 
>
> Key: SPARK-38573
> URL: https://issues.apache.org/jira/browse/SPARK-38573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
>
> Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
> the aggregated stats at table level for partitioned tables with config 
> spark.sql.statistics.size.autoUpdate.enabled.
> Supporting partition level stats are useful to know which partitions are 
> outliers (skewed partition) and query optimizer works better with partition 
> level stats in case of partition pruning.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-05-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38891.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36202
[https://github.com/apache/spark/pull/36202]

> Skipping allocating vector for repetition & definition levels when possible
> ---
>
> Key: SPARK-38891
> URL: https://issues.apache.org/jira/browse/SPARK-38891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently the vectorized Parquet reader will allocate vectors for repetition 
> and definition levels in all cases. However in certain cases (e.g., when 
> reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38891) Skipping allocating vector for repetition & definition levels when possible

2022-05-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-38891:


Assignee: Chao Sun

> Skipping allocating vector for repetition & definition levels when possible
> ---
>
> Key: SPARK-38891
> URL: https://issues.apache.org/jira/browse/SPARK-38891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Currently the vectorized Parquet reader will allocate vectors for repetition 
> and definition levels in all cases. However in certain cases (e.g., when 
> reading primitive types) this is not necessary and should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >