[jira] [Commented] (SPARK-37068) Confusing tgz filename for download

2021-10-22 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433211#comment-17433211
 ] 

Sean R. Owen commented on SPARK-37068:
--

Yes, too late to change it, but the 'hadoop-3.2' in the file name means '... or 
later' really. The code is compiled vs Hadoop 3.3. We'll eventually fix the 
profile names and thus release tarball, but that is the right download.

> Confusing tgz filename for download
> ---
>
> Key: SPARK-37068
> URL: https://issues.apache.org/jira/browse/SPARK-37068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.2.0
>Reporter: James Yu
>Priority: Minor
> Attachments: spark-download-issue.png
>
>
> In the Spark download webpage [https://spark.apache.org/downloads.html], the 
> package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename 
> contains "hadoop3.2" in it.  It is confusing; which version is correct?
>  
> Download Apache Spark(TM)
>  # Choose a Spark release: 3.2.0 (Oct 13 2021)
>  # Choose a package type: Pre-built for Apache Hadoop 3.3 and later
>  # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37084) Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37084.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34353
[https://github.com/apache/spark/pull/34353]

> Set spark.sql.files.openCostInBytes to bytesConf
> 
>
> Key: SPARK-37084
> URL: https://issues.apache.org/jira/browse/SPARK-37084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang He
>Assignee: Yang He
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37084) Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37084:


Assignee: Yang He

> Set spark.sql.files.openCostInBytes to bytesConf
> 
>
> Key: SPARK-37084
> URL: https://issues.apache.org/jira/browse/SPARK-37084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang He
>Assignee: Yang He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37068) Confusing tgz filename for download

2021-10-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433205#comment-17433205
 ] 

Hyukjin Kwon commented on SPARK-37068:
--

The name of the tar file would have to be change. see also SPARK-33880. They 
really mean Hadoop 3 support in general.
Since it's related out, it cannot be fixed at this moment though.

cc [~srowen] [~sunchao] FYI.

> Confusing tgz filename for download
> ---
>
> Key: SPARK-37068
> URL: https://issues.apache.org/jira/browse/SPARK-37068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.2.0
>Reporter: James Yu
>Priority: Minor
> Attachments: spark-download-issue.png
>
>
> In the Spark download webpage [https://spark.apache.org/downloads.html], the 
> package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename 
> contains "hadoop3.2" in it.  It is confusing; which version is correct?
>  
> Download Apache Spark(TM)
>  # Choose a Spark release: 3.2.0 (Oct 13 2021)
>  # Choose a package type: Pre-built for Apache Hadoop 3.3 and later
>  # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37096) Where clause and where operator will report error on varchar column type

2021-10-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433201#comment-17433201
 ] 

Hyukjin Kwon commented on SPARK-37096:
--

cc [~cloud_fan] FYI

> Where clause and where operator will report error on varchar column type
> 
>
> Key: SPARK-37096
> URL: https://issues.apache.org/jira/browse/SPARK-37096
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: HDP3.1.4
>Reporter: Ye Li
>Priority: Major
>
> create table test1(col1 int, col2 varchar(120)) stored as orc;
>  insert into test1 values(123, 'abc');
>  insert into test1 values(1234, 'abcd');
>  
> sparkSession.sql(‘select * from test1’)
>  is OK,but
> sparkSession.sql(‘select * from test1 where col2 = “abc”’)
>  or
>  sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’)
> report error:
> java.lang.UnsuppotedOperationException: DataType: varchar(120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37100) Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf

2021-10-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37100:
-
Fix Version/s: (was: 3.2.1)

> Pandas groupby UDFs would benefit from automatically redistributing data on 
> the groupby key in order to prevent network issues running udf
> --
>
> Key: SPARK-37100
> URL: https://issues.apache.org/jira/browse/SPARK-37100
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Richard Williamson
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> when running high cardinality pandas udf groupby steps (100,000s+ of unique 
> groups) - jobs will either fail or have high amount of task failures due to 
> network errors on larger clusters 100+ nodes - this was not the specific code 
> causing issues but should be close to representative:
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.functions import rand
> from fancyimpute import IterativeSVD
> import numpy as np
> import pandas as pd
> ​
> df = spark.range(0, 10).withColumn('v', rand())
> @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
> def solver(pdf):
> pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy()))
> return pdf
> ​
> df.groupby('id').apply(solver).count()
>  
> df.repartition('id') – this is required to fix it - can we make this 
> automatically happen without any adverse impacts?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37096) Where clause and where operator will report error on varchar column type

2021-10-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37096:
-
Priority: Major  (was: Critical)

> Where clause and where operator will report error on varchar column type
> 
>
> Key: SPARK-37096
> URL: https://issues.apache.org/jira/browse/SPARK-37096
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: HDP3.1.4
>Reporter: Ye Li
>Priority: Major
>
> create table test1(col1 int, col2 varchar(120)) stored as orc;
>  insert into test1 values(123, 'abc');
>  insert into test1 values(1234, 'abcd');
>  
> sparkSession.sql(‘select * from test1’)
>  is OK,but
> sparkSession.sql(‘select * from test1 where col2 = “abc”’)
>  or
>  sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’)
> report error:
> java.lang.UnsuppotedOperationException: DataType: varchar(120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37100) Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf

2021-10-22 Thread Richard Williamson (Jira)
Richard Williamson created SPARK-37100:
--

 Summary: Pandas groupby UDFs would benefit from automatically 
redistributing data on the groupby key in order to prevent network issues 
running udf
 Key: SPARK-37100
 URL: https://issues.apache.org/jira/browse/SPARK-37100
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Richard Williamson
 Fix For: 3.2.1


when running high cardinality pandas udf groupby steps (100,000s+ of unique 
groups) - jobs will either fail or have high amount of task failures due to 
network errors on larger clusters 100+ nodes - this was not the specific code 
causing issues but should be close to representative:


from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import rand
from fancyimpute import IterativeSVD
import numpy as np
import pandas as pd
​
df = spark.range(0, 10).withColumn('v', rand())
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def solver(pdf):
pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy()))
return pdf
​
df.groupby('id').apply(solver).count()
 
df.repartition('id') – this is required to fix it - can we make this 
automatically happen without any adverse impacts?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36554) Error message while trying to use spark sql functions directly on dataframe columns without using select expression

2021-10-22 Thread Nicolas Azrak (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433137#comment-17433137
 ] 

Nicolas Azrak commented on SPARK-36554:
---

[~lekshmiii] I've added a test to validate this is working. If you are using 
spark in a project and need this fix you would have to compile it using the 
patch I've submitted in the PR. 

> Error message while trying to use spark sql functions directly on dataframe 
> columns without using select expression
> ---
>
> Key: SPARK-36554
> URL: https://issues.apache.org/jira/browse/SPARK-36554
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples, PySpark
>Affects Versions: 3.1.1
>Reporter: Lekshmi Ramachandran
>Priority: Minor
>  Labels: documentation, features, functions, spark-sql
> Attachments: Screen Shot .png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The below code generates a dataframe successfully . Here make_date function 
> is used inside a select expression
>  
> from pyspark.sql.functions import  expr, make_date
> df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 
> 'M', 'D'])
> df.select("*",expr("make_date(Y,M,D) as lk")).show()
>  
> The below code fails with a message "cannot import name 'make_date' from 
> 'pyspark.sql.functions'" . Here the make_date function is directly called on 
> dataframe columns without select expression
>  
> from pyspark.sql.functions import make_date
> df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 
> 'M', 'D'])
> df.select(make_date(df.Y,df.M,df.D).alias("datefield")).show()
>  
> The error message generated is misleading when it says "cannot  import 
> make_date from pyspark.sql.functions"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36554) Error message while trying to use spark sql functions directly on dataframe columns without using select expression

2021-10-22 Thread Lekshmi Ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432714#comment-17432714
 ] 

Lekshmi Ramachandran edited comment on SPARK-36554 at 10/22/21, 5:27 PM:
-

@Nicolas Azrak  So how do I test if it is working ?


was (Author: lekshmiii):
@Nicolas Azrak  So how do it test if it is working ?

> Error message while trying to use spark sql functions directly on dataframe 
> columns without using select expression
> ---
>
> Key: SPARK-36554
> URL: https://issues.apache.org/jira/browse/SPARK-36554
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples, PySpark
>Affects Versions: 3.1.1
>Reporter: Lekshmi Ramachandran
>Priority: Minor
>  Labels: documentation, features, functions, spark-sql
> Attachments: Screen Shot .png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The below code generates a dataframe successfully . Here make_date function 
> is used inside a select expression
>  
> from pyspark.sql.functions import  expr, make_date
> df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 
> 'M', 'D'])
> df.select("*",expr("make_date(Y,M,D) as lk")).show()
>  
> The below code fails with a message "cannot import name 'make_date' from 
> 'pyspark.sql.functions'" . Here the make_date function is directly called on 
> dataframe columns without select expression
>  
> from pyspark.sql.functions import make_date
> df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 
> 'M', 'D'])
> df.select(make_date(df.Y,df.M,df.D).alias("datefield")).show()
>  
> The error message generated is misleading when it says "cannot  import 
> make_date from pyspark.sql.functions"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37091:
-
Priority: Trivial  (was: Major)

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Trivial
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433078#comment-17433078
 ] 

Dongjoon Hyun edited comment on SPARK-37091 at 10/22/21, 5:13 PM:
--

BTW, [~Bidek]. Please don't set `Target Version` next time. Apache Spark 
community has a policy for that.
- [https://spark.apache.org/contributing.html]

{code}
Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.
{code}


was (Author: dongjoon):
BTW, [~Bidek]. Please don't set `Target Version`.
- [https://spark.apache.org/contributing.html]

{code}
Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.
{code}

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37091:
--
Fix Version/s: (was: 3.2.1)

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433078#comment-17433078
 ] 

Dongjoon Hyun edited comment on SPARK-37091 at 10/22/21, 5:13 PM:
--

BTW, [~Bidek]. Please don't set `Fix Version` and `Target Version` next time. 
Apache Spark community has a policy for that. The fields have different meaning 
in the community.
- [https://spark.apache.org/contributing.html]

{code}
Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.
{code}


was (Author: dongjoon):
BTW, [~Bidek]. Please don't set `Target Version` next time. Apache Spark 
community has a policy for that.
- [https://spark.apache.org/contributing.html]

{code}
Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.
{code}

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433078#comment-17433078
 ] 

Dongjoon Hyun commented on SPARK-37091:
---

BTW, [~Bidek]. Please don't set `Target Version`.
- [https://spark.apache.org/contributing.html]

{code}
Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.
{code}

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37091:
--
Target Version/s:   (was: 3.3.0)

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements

2021-10-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37091:
--
Summary: Support Java 17 in SparkR SystemRequirements  (was: Bump 
SystemRequirements to use Java 17)

> Support Java 17 in SparkR SystemRequirements
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java 17

2021-10-22 Thread Darek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darek updated SPARK-37091:
--
Description: 
Please bump Java version to <= 17 in 
[DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]

Currently it is set to be:
{code:java}
SystemRequirements: Java (>= 8, < 12){code}
 [PR|https://github.com/apache/spark/pull/34371] has been created for this 
issue already.

  was:
Please bump Java version to <= 17 in 
[DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]

Currently it is set to be:
{code:java}
SystemRequirements: Java (>= 8, < 12){code}
 
[PR|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION#L16]
 has been created for this issue already.


> Bump SystemRequirements to use Java 17
> --
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  [PR|https://github.com/apache/spark/pull/34371] has been created for this 
> issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java 17

2021-10-22 Thread Darek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darek updated SPARK-37091:
--
Description: 
Please bump Java version to <= 17 in 
[DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]

Currently it is set to be:
{code:java}
SystemRequirements: Java (>= 8, < 12){code}
 
[PR|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION#L16]
 has been created for this issue already.

  was:
Please bump Java version to <= 17 in 
[DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]

Currently it is set to be:
{code:java}
SystemRequirements: Java (>= 8, < 12){code}
 


> Bump SystemRequirements to use Java 17
> --
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  
> [PR|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION#L16]
>  has been created for this issue already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java 17

2021-10-22 Thread Darek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darek updated SPARK-37091:
--
 Target Version/s: 3.3.0  (was: 3.2.0)
Affects Version/s: (was: 3.2.0)
   3.3.0
  Description: 
Please bump Java version to <= 17 in 
[DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]

Currently it is set to be:
{code:java}
SystemRequirements: Java (>= 8, < 12){code}
 

  was:
Please bump Java version to > 11 in 
[DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]

Currently it is set to be:
{code:java}
SystemRequirements: Java (>= 8, < 12){code}
 

  Summary: Bump SystemRequirements to use Java 17  (was: Bump 
SystemRequirements to use Java > 11)

> Bump SystemRequirements to use Java 17
> --
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.3.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to <= 17 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java > 11

2021-10-22 Thread Darek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darek updated SPARK-37091:
--
Parent: SPARK-33772
Issue Type: Sub-task  (was: Improvement)

> Bump SystemRequirements to use Java > 11
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.2.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to > 11 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35703) Relax constraint for Spark bucket join and remove HashClusteredDistribution

2021-10-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35703:
-
Summary: Relax constraint for Spark bucket join and remove 
HashClusteredDistribution  (was: Remove HashClusteredDistribution)

> Relax constraint for Spark bucket join and remove HashClusteredDistribution
> ---
>
> Key: SPARK-35703
> URL: https://issues.apache.org/jira/browse/SPARK-35703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37091) Bump SystemRequirements to use Java > 11

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37091:


Assignee: Apache Spark

> Bump SystemRequirements to use Java > 11
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.2.0
>Reporter: Darek
>Assignee: Apache Spark
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to > 11 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37091) Bump SystemRequirements to use Java > 11

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433051#comment-17433051
 ] 

Apache Spark commented on SPARK-37091:
--

User 'Bidek56' has created a pull request for this issue:
https://github.com/apache/spark/pull/34371

> Bump SystemRequirements to use Java > 11
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.2.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to > 11 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37091) Bump SystemRequirements to use Java > 11

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433053#comment-17433053
 ] 

Apache Spark commented on SPARK-37091:
--

User 'Bidek56' has created a pull request for this issue:
https://github.com/apache/spark/pull/34371

> Bump SystemRequirements to use Java > 11
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.2.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to > 11 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37091) Bump SystemRequirements to use Java > 11

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37091:


Assignee: (was: Apache Spark)

> Bump SystemRequirements to use Java > 11
> 
>
> Key: SPARK-37091
> URL: https://issues.apache.org/jira/browse/SPARK-37091
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.2.0
>Reporter: Darek
>Priority: Major
>  Labels: newbie
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Please bump Java version to > 11 in 
> [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION]
> Currently it is set to be:
> {code:java}
> SystemRequirements: Java (>= 8, < 12){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433048#comment-17433048
 ] 

Apache Spark commented on SPARK-37047:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34370

> Add overloads for lpad and rpad for BINARY strings
> --
>
> Key: SPARK-37047
> URL: https://issues.apache.org/jira/browse/SPARK-37047
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelaos Karavelas
>Assignee: Menelaos Karavelas
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of 
> input string to be padded and padding pattern), and these strings get cast to 
> UTF8 strings. The result of the operation is a UTF8 string which may be 
> invalid as it can contain non-UTF8 characters.
> What we would like to do is to overload `lpad` and `rpad` to accept BINARY 
> strings as inputs (both for the string to be padded and the padding pattern) 
> and produce a left or right padded BINARY string as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37089) ParquetFileFormat registers task completion listeners lazily, causing Python writer thread to segfault when off-heap vectorized reader is enabled

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433020#comment-17433020
 ] 

Apache Spark commented on SPARK-37089:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/34369

> ParquetFileFormat registers task completion listeners lazily, causing Python 
> writer thread to segfault when off-heap vectorized reader is enabled
> -
>
> Key: SPARK-37089
> URL: https://issues.apache.org/jira/browse/SPARK-37089
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> The task completion listener that closes the vectorized reader is registered 
> lazily in ParquetFileFormat#buildReaderWithPartitionValues(). Since task 
> completion listeners are executed in reverse order of registration, it always 
> runs before the Python writer thread can be interrupted.
> This contradicts the assumption in 
> https://issues.apache.org/jira/browse/SPARK-37088 / 
> https://github.com/apache/spark/pull/34245 that task completion listeners are 
> registered bottom-up, preventing that fix from working properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon

2021-10-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37067.
-
Fix Version/s: 3.3.0
   3.2.1
 Assignee: Linhong Liu
   Resolution: Fixed

> DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
> 
>
> Key: SPARK-37067
> URL: https://issues.apache.org/jira/browse/SPARK-37067
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> For the zoneid with format like "+" or "+0730", it can be parsed by 
> `ZoneId.of()` but will rejected by Spark's 
> `DateTimeUtils.stringToTimestamp()`. it means we will return null for some 
> valid datetime string, such as: `2021-10-11T03:58:03.000+0700`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37072) Pass all UTs in `repl` with Java 17

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432964#comment-17432964
 ] 

Apache Spark commented on SPARK-37072:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34368

> Pass all UTs in `repl` with Java 17
> ---
>
> Key: SPARK-37072
> URL: https://issues.apache.org/jira/browse/SPARK-37072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `mvn clean install -pl repl` with Java 17
> {code:java}
> Run completed in 30 seconds, 826 milliseconds.
> Total number of tests run: 42
> Suites: completed 6, aborted 0
> Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0
> *** 9 TESTS FAILED ***
> {code}
> The test failed as similar reasons:
> {code:java}
> - broadcast vars *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>                       __
>        / __/__  ___ _/ /__
>       _\ \/ _ \/ _ `/ __/  '_/
>      /___/ .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>         /_/
>            
>   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>   
>   scala> 
>   scala> array: Array[Int] = Array(0, 0, 0, 0, 0)
>   
>   scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = 
> Broadcast(0)
>   
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2879/0x00080188b928.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala> 
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2907/0x0008019536f8.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala>      | 
>   scala> :quit (ReplSuite.scala:83)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37072) Pass all UTs in `repl` with Java 17

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37072:


Assignee: (was: Apache Spark)

> Pass all UTs in `repl` with Java 17
> ---
>
> Key: SPARK-37072
> URL: https://issues.apache.org/jira/browse/SPARK-37072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `mvn clean install -pl repl` with Java 17
> {code:java}
> Run completed in 30 seconds, 826 milliseconds.
> Total number of tests run: 42
> Suites: completed 6, aborted 0
> Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0
> *** 9 TESTS FAILED ***
> {code}
> The test failed as similar reasons:
> {code:java}
> - broadcast vars *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>                       __
>        / __/__  ___ _/ /__
>       _\ \/ _ \/ _ `/ __/  '_/
>      /___/ .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>         /_/
>            
>   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>   
>   scala> 
>   scala> array: Array[Int] = Array(0, 0, 0, 0, 0)
>   
>   scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = 
> Broadcast(0)
>   
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2879/0x00080188b928.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala> 
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2907/0x0008019536f8.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala>      | 
>   scala> :quit (ReplSuite.scala:83)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37072) Pass all UTs in `repl` with Java 17

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432963#comment-17432963
 ] 

Apache Spark commented on SPARK-37072:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34368

> Pass all UTs in `repl` with Java 17
> ---
>
> Key: SPARK-37072
> URL: https://issues.apache.org/jira/browse/SPARK-37072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Run `mvn clean install -pl repl` with Java 17
> {code:java}
> Run completed in 30 seconds, 826 milliseconds.
> Total number of tests run: 42
> Suites: completed 6, aborted 0
> Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0
> *** 9 TESTS FAILED ***
> {code}
> The test failed as similar reasons:
> {code:java}
> - broadcast vars *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>                       __
>        / __/__  ___ _/ /__
>       _\ \/ _ \/ _ `/ __/  '_/
>      /___/ .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>         /_/
>            
>   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>   
>   scala> 
>   scala> array: Array[Int] = Array(0, 0, 0, 0, 0)
>   
>   scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = 
> Broadcast(0)
>   
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2879/0x00080188b928.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala> 
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2907/0x0008019536f8.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala>      | 
>   scala> :quit (ReplSuite.scala:83)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37072) Pass all UTs in `repl` with Java 17

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37072:


Assignee: Apache Spark

> Pass all UTs in `repl` with Java 17
> ---
>
> Key: SPARK-37072
> URL: https://issues.apache.org/jira/browse/SPARK-37072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> Run `mvn clean install -pl repl` with Java 17
> {code:java}
> Run completed in 30 seconds, 826 milliseconds.
> Total number of tests run: 42
> Suites: completed 6, aborted 0
> Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0
> *** 9 TESTS FAILED ***
> {code}
> The test failed as similar reasons:
> {code:java}
> - broadcast vars *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>                       __
>        / __/__  ___ _/ /__
>       _\ \/ _ \/ _ `/ __/  '_/
>      /___/ .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>         /_/
>            
>   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>   
>   scala> 
>   scala> array: Array[Int] = Array(0, 0, 0, 0, 0)
>   
>   scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = 
> Broadcast(0)
>   
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2879/0x00080188b928.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala> 
>   scala> java.lang.IllegalAccessException: Can not set final $iw field 
> $Lambda$2907/0x0008019536f8.arg$1 to $iw
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76)
>     at 
> java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80)
>     at 
> java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79)
>     at java.base/java.lang.reflect.Field.set(Field.java:799)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2490)
>     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
>     at org.apache.spark.rdd.RDD.map(RDD.scala:413)
>     ... 95 elided
>   
>   scala>      | 
>   scala> :quit (ReplSuite.scala:83)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading

2021-10-22 Thread jinhai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432914#comment-17432914
 ] 

jinhai commented on SPARK-37006:


hi [~Ngone51], can you review this issue for me?

> MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs 
> when shuffle reading
> -
>
> Key: SPARK-37006
> URL: https://issues.apache.org/jira/browse/SPARK-37006
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.2
>Reporter: jinhai
>Priority: Major
>
> When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, 
> in order to obtain the hostLocalDirs value, we need to send an RPC request 
> through ExternalBlockStoreClient or NettyBlockTransferService. Then get 
> shuffle data according to blockId and localDirs.
> We can add localDir to the BlockManagerId class of MapStatus, so that we can 
> get localDir directly when fetch host-local blocks without sending RPC 
> requests.
> The benefits are:
> 1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
> 2. When the external shuffle service is enabled, there is no need to register 
> ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save 
> the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class 
> through leveldb.
> 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager 
> class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading

2021-10-22 Thread jinhai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jinhai updated SPARK-37006:
---
Comment: was deleted

(was: hi [~Ngone51], can you review this issue for me?)

> MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs 
> when shuffle reading
> -
>
> Key: SPARK-37006
> URL: https://issues.apache.org/jira/browse/SPARK-37006
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.2
>Reporter: jinhai
>Priority: Major
>
> When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, 
> in order to obtain the hostLocalDirs value, we need to send an RPC request 
> through ExternalBlockStoreClient or NettyBlockTransferService. Then get 
> shuffle data according to blockId and localDirs.
> We can add localDir to the BlockManagerId class of MapStatus, so that we can 
> get localDir directly when fetch host-local blocks without sending RPC 
> requests.
> The benefits are:
> 1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
> 2. When the external shuffle service is enabled, there is no need to register 
> ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save 
> the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class 
> through leveldb.
> 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager 
> class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading

2021-10-22 Thread jinhai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429079#comment-17429079
 ] 

jinhai edited comment on SPARK-37006 at 10/22/21, 11:01 AM:


Or whether we can generate localDirs based on appId and execId, just like 
DiskBlockManager.getFile, so that we don't need to save localDirs in MapStatus, 
just add appId to MapStatus


was (Author: csbliss):
Or whether we can generate localDirs based on appId and execId, just like 
DiskBlockManager.getFile, so that we don't need to save localDirs in MapStatus, 
just add appId.

> MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs 
> when shuffle reading
> -
>
> Key: SPARK-37006
> URL: https://issues.apache.org/jira/browse/SPARK-37006
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.2
>Reporter: jinhai
>Priority: Major
>
> When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, 
> in order to obtain the hostLocalDirs value, we need to send an RPC request 
> through ExternalBlockStoreClient or NettyBlockTransferService. Then get 
> shuffle data according to blockId and localDirs.
> We can add localDir to the BlockManagerId class of MapStatus, so that we can 
> get localDir directly when fetch host-local blocks without sending RPC 
> requests.
> The benefits are:
> 1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
> 2. When the external shuffle service is enabled, there is no need to register 
> ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save 
> the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class 
> through leveldb.
> 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager 
> class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading

2021-10-22 Thread jinhai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jinhai updated SPARK-37006:
---
Description: 
When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, in 
order to obtain the hostLocalDirs value, we need to send an RPC request through 
ExternalBlockStoreClient or NettyBlockTransferService. Then get shuffle data 
according to blockId and localDirs.

We can add localDir to the BlockManagerId class of MapStatus, so that we can 
get localDir directly when fetch host-local blocks without sending RPC requests.

The benefits are:
1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
2. When the external shuffle service is enabled, there is no need to register 
ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save the 
ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class through 
leveldb.
3. Also, there is no need to cache host-local dirs in the HostLocalDirManager 
class.

  was:
In shuffle reading, in order to get the hostLocalDirs value when executing 
fetchHostLocalBlocks, we need ExternalBlockStoreClient or 
NettyBlockTransferService to make a rpc request.

And when externalShuffleServiceEnabled, there is no need to registerExecutor 
and so on in the ExternalShuffleBlockResolver class.

Throughout the spark shuffle module, a lot of code logic is written to deal 
with localDirs.

We can directly add localDirs to the BlockManagerId class of MapStatus to get 
datafile and indexfile.


> MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs 
> when shuffle reading
> -
>
> Key: SPARK-37006
> URL: https://issues.apache.org/jira/browse/SPARK-37006
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.2
>Reporter: jinhai
>Priority: Major
>
> When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, 
> in order to obtain the hostLocalDirs value, we need to send an RPC request 
> through ExternalBlockStoreClient or NettyBlockTransferService. Then get 
> shuffle data according to blockId and localDirs.
> We can add localDir to the BlockManagerId class of MapStatus, so that we can 
> get localDir directly when fetch host-local blocks without sending RPC 
> requests.
> The benefits are:
> 1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
> 2. When the external shuffle service is enabled, there is no need to register 
> ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save 
> the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class 
> through leveldb.
> 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager 
> class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37099:


Assignee: (was: Apache Spark)

> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: skewed_window.png
>
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
> {code:java}
>  select (... row_number() over(partition by ... order by ...) as rn)
> where rn ==[\<=] k{code}
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432910#comment-17432910
 ] 

Apache Spark commented on SPARK-37099:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/34367

> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: skewed_window.png
>
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
> {code:java}
>  select (... row_number() over(partition by ... order by ...) as rn)
> where rn ==[\<=] k{code}
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37099:


Assignee: Apache Spark

> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
> Attachments: skewed_window.png
>
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
> {code:java}
>  select (... row_number() over(partition by ... order by ...) as rn)
> where rn ==[\<=] k{code}
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-37099:
-
Description: 
in JD, we found that more than 80% usage of window function follows this 
pattern:

 
 select (... row_number() over(partition by ... order by ...) as rn)
 where rn ==[\<=] k
  

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation. 

 

A real-world skewed-window case in our system is attached.

 

  was:
in JD, we found that more than 80% usage of window function follows this 
pattern:

 
 select (... row_number() over(partition by ... order by ...) as rn)
 where rn ==[\<=] k
  

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation.

 

A real-world skewed-window case in our system is attached.

 


> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: skewed_window.png
>
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
>  
>  select (... row_number() over(partition by ... order by ...) as rn)
>  where rn ==[\<=] k
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation. 
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-37099:
-
Attachment: skewed_window.png

> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: skewed_window.png
>
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
>  
>  select (... row_number() over(partition by ... order by ...) as rn)
>  where rn ==[\<=] k
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-37099:
-
Description: 
in JD, we found that more than 80% usage of window function follows this 
pattern:
{code:java}
 select (... row_number() over(partition by ... order by ...) as rn)
where rn ==[\<=] k{code}


  

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation.

 

A real-world skewed-window case in our system is attached.

 

  was:
in JD, we found that more than 80% usage of window function follows this 
pattern:

 
 select (... row_number() over(partition by ... order by ...) as rn)
 where rn ==[\<=] k
  

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation. 

 

A real-world skewed-window case in our system is attached.

 


> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: skewed_window.png
>
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
> {code:java}
>  select (... row_number() over(partition by ... order by ...) as rn)
> where rn ==[\<=] k{code}
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-37099:


 Summary: Impl a rank-based filter to optimize top-k computation
 Key: SPARK-37099
 URL: https://issues.apache.org/jira/browse/SPARK-37099
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: zhengruifeng


in JD, we found that more than 80% usage of window function follows this 
pattern:

 
select (... row_number() over(partition by ... order by ...) as rn)
   where rn ==[\<=] k
 

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation.

 

This is a real-world skewed-window case in our system:

!image-2021-10-22-18-46-58-496.png!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-37099:
-
Description: 
in JD, we found that more than 80% usage of window function follows this 
pattern:

 
 select (... row_number() over(partition by ... order by ...) as rn)
 where rn ==[\<=] k
  

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation.

 

A real-world skewed-window case in our system is attached.

 

  was:
in JD, we found that more than 80% usage of window function follows this 
pattern:

 
select (... row_number() over(partition by ... order by ...) as rn)
   where rn ==[\<=] k
 

However, existing physical plan is not optimum:

 

1, we should select local top-k records within each partitions, and then 
compute the global top-k. this can help reduce the shuffle amount;

 

2, skewed-window: some partition is skewed and take a long time to finish 
computation.

 

This is a real-world skewed-window case in our system:

!image-2021-10-22-18-46-58-496.png!

 


> Impl a rank-based filter to optimize top-k computation
> --
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
>
> in JD, we found that more than 80% usage of window function follows this 
> pattern:
>  
>  select (... row_number() over(partition by ... order by ...) as rn)
>  where rn ==[\<=] k
>   
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37016) Publicise UpperCaseCharStream

2021-10-22 Thread dohongdayi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432902#comment-17432902
 ] 

dohongdayi commented on SPARK-37016:


Anyone care about this issue?

> Publicise UpperCaseCharStream
> -
>
> Key: SPARK-37016
> URL: https://issues.apache.org/jira/browse/SPARK-37016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.1, 3.1.2, 3.2.0
>Reporter: dohongdayi
>Priority: Major
>
> Many Spark extension projects are copying `UpperCaseCharStream` because it is 
> private beneath `parser` package, such as:
> [Delta 
> Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290]
> [Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112]
> [Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175]
> [Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31]
> [Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108]
> [Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13]
> We can publicise `UpperCaseCharStream` to eliminate code duplication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37098) Alter table properties should invalidate cache

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37098:


Assignee: Apache Spark

> Alter table properties should invalidate cache
> --
>
> Key: SPARK-37098
> URL: https://issues.apache.org/jira/browse/SPARK-37098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> The table properties can change the behavior of wriing. e.g. the parquet 
> table with `parquet.compression`.
> If you execute the following SQL, we will get the file with snappy 
> compression rather than zstd.
> {code:java}
> CREATE TABLE t (c int) STORED AS PARQUET;
> // cache table metadata
> SELECT * FROM t;
> ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd');
> INSERT INTO TABLE t values(1);
> {code}
> So we should invalidate the table cache after alter table properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37098) Alter table properties should invalidate cache

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37098:


Assignee: (was: Apache Spark)

> Alter table properties should invalidate cache
> --
>
> Key: SPARK-37098
> URL: https://issues.apache.org/jira/browse/SPARK-37098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The table properties can change the behavior of wriing. e.g. the parquet 
> table with `parquet.compression`.
> If you execute the following SQL, we will get the file with snappy 
> compression rather than zstd.
> {code:java}
> CREATE TABLE t (c int) STORED AS PARQUET;
> // cache table metadata
> SELECT * FROM t;
> ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd');
> INSERT INTO TABLE t values(1);
> {code}
> So we should invalidate the table cache after alter table properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37097:


Assignee: Apache Spark

> yarn-cluster mode, unregister timeout cause spark retry but AM container exit 
> with code 0
> -
>
> Key: SPARK-37097
> URL: https://issues.apache.org/jira/browse/SPARK-37097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> 1. Cluster mode AM shutdown hook triggered
> 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM 
> container exit with code 0.
> 3. Since RM lose connection with AM, then treat this container as failed.
> 4. Then client side got application report as final status failed but am 
> container exit code 0. Then retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0

2021-10-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37097:


Assignee: (was: Apache Spark)

> yarn-cluster mode, unregister timeout cause spark retry but AM container exit 
> with code 0
> -
>
> Key: SPARK-37097
> URL: https://issues.apache.org/jira/browse/SPARK-37097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> 1. Cluster mode AM shutdown hook triggered
> 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM 
> container exit with code 0.
> 3. Since RM lose connection with AM, then treat this container as failed.
> 4. Then client side got application report as final status failed but am 
> container exit code 0. Then retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432892#comment-17432892
 ] 

Apache Spark commented on SPARK-37097:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34366

> yarn-cluster mode, unregister timeout cause spark retry but AM container exit 
> with code 0
> -
>
> Key: SPARK-37097
> URL: https://issues.apache.org/jira/browse/SPARK-37097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> 1. Cluster mode AM shutdown hook triggered
> 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM 
> container exit with code 0.
> 3. Since RM lose connection with AM, then treat this container as failed.
> 4. Then client side got application report as final status failed but am 
> container exit code 0. Then retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37098) Alter table properties should invalidate cache

2021-10-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432891#comment-17432891
 ] 

Apache Spark commented on SPARK-37098:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34365

> Alter table properties should invalidate cache
> --
>
> Key: SPARK-37098
> URL: https://issues.apache.org/jira/browse/SPARK-37098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The table properties can change the behavior of wriing. e.g. the parquet 
> table with `parquet.compression`.
> If you execute the following SQL, we will get the file with snappy 
> compression rather than zstd.
> {code:java}
> CREATE TABLE t (c int) STORED AS PARQUET;
> // cache table metadata
> SELECT * FROM t;
> ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd');
> INSERT INTO TABLE t values(1);
> {code}
> So we should invalidate the table cache after alter table properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37098) Alter table properties should invalidate cache

2021-10-22 Thread XiDuo You (Jira)
XiDuo You created SPARK-37098:
-

 Summary: Alter table properties should invalidate cache
 Key: SPARK-37098
 URL: https://issues.apache.org/jira/browse/SPARK-37098
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.0.3, 3.3.0
Reporter: XiDuo You


The table properties can change the behavior of wriing. e.g. the parquet table 
with `parquet.compression`.

If you execute the following SQL, we will get the file with snappy compression 
rather than zstd.
{code:java}
CREATE TABLE t (c int) STORED AS PARQUET;
// cache table metadata
SELECT * FROM t;
ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd');
INSERT INTO TABLE t values(1);
{code}
So we should invalidate the table cache after alter table properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0

2021-10-22 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37097:
--
Description: 

1. Cluster mode AM shutdown hook triggered
2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM 
container exit with code 0.
3. Since RM lose connection with AM, then treat this container as failed.
4. Then client side got application report as final status failed but am 
container exit code 0. Then retry.

  was:
Cluster mode AM shutdown hook triggered, am unregister from RM timeout, but AM 
shutdown hook have try catch, so AM container exit with code 0. But since RM 
lose connection with AM, then treat this container as failed.

Then client side got application report as final status failed but am container 
exit code 0. Then retry.


> yarn-cluster mode, unregister timeout cause spark retry but AM container exit 
> with code 0
> -
>
> Key: SPARK-37097
> URL: https://issues.apache.org/jira/browse/SPARK-37097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> 1. Cluster mode AM shutdown hook triggered
> 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM 
> container exit with code 0.
> 3. Since RM lose connection with AM, then treat this container as failed.
> 4. Then client side got application report as final status failed but am 
> container exit code 0. Then retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37073) Pass all UTs in `external/avro` with Java 17

2021-10-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37073.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34364
[https://github.com/apache/spark/pull/34364]

> Pass all UTs in `external/avro` with Java 17
> 
>
> Key: SPARK-37073
> URL: https://issues.apache.org/jira/browse/SPARK-37073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> Run `mvn clean install -pl external/avro` with Java 17
>  
>  
> {code:java}
> Run completed in 43 seconds, 988 milliseconds.
> Total number of tests run: 283
> Suites: completed 14, aborted 0
> Tests: succeeded 281, failed 2, canceled 0, ignored 2, pending 0
> *** 2 TESTS FAILED ***
> {code}
>  
> {code:java}
> - support user provided non-nullable avro schema for nullable catalyst schema 
> without any null record *** FAILED ***
>   "Job aborted due to stage failure: Task 1 in stage 144.0 failed 1 times, 
> most recent failure: Lost task 1.0 in stage 144.0 (TID 250) (localhost 
> executor driver): org.apache.spark.SparkException: Task failed while writing 
> rows.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:516)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:345)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:136)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:833)
>   Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: 
> java.lang.NullPointerException: Cannot invoke "Object.getClass()" because 
> "datum" is null of string in string in field Name of test_schema in 
> test_schema
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:317)
>   at 
> org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:84)
>   at 
> org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:62)
>   at 
> org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:328)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1502)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:335)
>   ... 9 more
>   Caused by: java.lang.NullPointerException: Cannot invoke 
> "Object.getClass()" because "datum" is null of string in string in field Name 
> of test_schema in test_schema
>   at 
> org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:184)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:160)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314)
>   ... 18 more
>   Caused by: java.lang.NullPointerException: Cannot invoke 
> "Object.getClass()" because "datum" is null
>   at 
> org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:68)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:151)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:83)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:221)
>   at 
> 

[jira] [Assigned] (SPARK-37073) Pass all UTs in `external/avro` with Java 17

2021-10-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37073:


Assignee: Yang Jie

> Pass all UTs in `external/avro` with Java 17
> 
>
> Key: SPARK-37073
> URL: https://issues.apache.org/jira/browse/SPARK-37073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> Run `mvn clean install -pl external/avro` with Java 17
>  
>  
> {code:java}
> Run completed in 43 seconds, 988 milliseconds.
> Total number of tests run: 283
> Suites: completed 14, aborted 0
> Tests: succeeded 281, failed 2, canceled 0, ignored 2, pending 0
> *** 2 TESTS FAILED ***
> {code}
>  
> {code:java}
> - support user provided non-nullable avro schema for nullable catalyst schema 
> without any null record *** FAILED ***
>   "Job aborted due to stage failure: Task 1 in stage 144.0 failed 1 times, 
> most recent failure: Lost task 1.0 in stage 144.0 (TID 250) (localhost 
> executor driver): org.apache.spark.SparkException: Task failed while writing 
> rows.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:516)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:345)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:136)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:833)
>   Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: 
> java.lang.NullPointerException: Cannot invoke "Object.getClass()" because 
> "datum" is null of string in string in field Name of test_schema in 
> test_schema
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:317)
>   at 
> org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:84)
>   at 
> org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:62)
>   at 
> org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:328)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1502)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:335)
>   ... 9 more
>   Caused by: java.lang.NullPointerException: Cannot invoke 
> "Object.getClass()" because "datum" is null of string in string in field Name 
> of test_schema in test_schema
>   at 
> org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:184)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:160)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314)
>   ... 18 more
>   Caused by: java.lang.NullPointerException: Cannot invoke 
> "Object.getClass()" because "datum" is null
>   at 
> org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:68)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:151)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:83)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:221)
>   at 
> org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:101)
>   at 
> 

[jira] [Created] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0

2021-10-22 Thread angerszhu (Jira)
angerszhu created SPARK-37097:
-

 Summary: yarn-cluster mode, unregister timeout cause spark retry 
but AM container exit with code 0
 Key: SPARK-37097
 URL: https://issues.apache.org/jira/browse/SPARK-37097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Cluster mode AM shutdown hook triggered, am unregister from RM timeout, but AM 
shutdown hook have try catch, so AM container exit with code 0. But since RM 
lose connection with AM, then treat this container as failed.

Then client side got application report as final status failed but am container 
exit code 0. Then retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37096) Where clause and where operator will report error on varchar column type

2021-10-22 Thread Ye Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Li updated SPARK-37096:
--
Description: 
create table test1(col1 int, col2 varchar(120)) stored as orc;
 insert into test1 values(123, 'abc');
 insert into test1 values(1234, 'abcd');

 

sparkSession.sql(‘select * from test1’)
 is OK,but

sparkSession.sql(‘select * from test1 where col2 = “abc”’)
 or
 sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’)

report error:

java.lang.UnsuppotedOperationException: DataType: varchar(120)

  was:
create table test1(col1 int, col2 varchar(120)) stored as orc;
insert into test1 values(123, 'abc');
insert into test1 values(1234, 'abcd');

 

sparkSession.sql(‘select * from bdctemp.liye_test202110212’)
is OK,but

sparkSession.sql(‘select * from bdctemp.liye_test202110212 where col2 = “abc”’)
or
sparkSession.sql(‘select * from bdctemp.liye_test202110212’).where(‘col2 = 
“abc”’)

report error:

java.lang.UnsuppotedOperationException: DataType: varchar(120)

Environment: HDP3.1.4
   Priority: Critical  (was: Major)

> Where clause and where operator will report error on varchar column type
> 
>
> Key: SPARK-37096
> URL: https://issues.apache.org/jira/browse/SPARK-37096
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: HDP3.1.4
>Reporter: Ye Li
>Priority: Critical
>
> create table test1(col1 int, col2 varchar(120)) stored as orc;
>  insert into test1 values(123, 'abc');
>  insert into test1 values(1234, 'abcd');
>  
> sparkSession.sql(‘select * from test1’)
>  is OK,but
> sparkSession.sql(‘select * from test1 where col2 = “abc”’)
>  or
>  sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’)
> report error:
> java.lang.UnsuppotedOperationException: DataType: varchar(120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37096) Where clause and where operator will report error on varchar column type

2021-10-22 Thread Ye Li (Jira)
Ye Li created SPARK-37096:
-

 Summary: Where clause and where operator will report error on 
varchar column type
 Key: SPARK-37096
 URL: https://issues.apache.org/jira/browse/SPARK-37096
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.1.2, 3.1.1
Reporter: Ye Li


create table test1(col1 int, col2 varchar(120)) stored as orc;
insert into test1 values(123, 'abc');
insert into test1 values(1234, 'abcd');

 

sparkSession.sql(‘select * from bdctemp.liye_test202110212’)
is OK,but

sparkSession.sql(‘select * from bdctemp.liye_test202110212 where col2 = “abc”’)
or
sparkSession.sql(‘select * from bdctemp.liye_test202110212’).where(‘col2 = 
“abc”’)

report error:

java.lang.UnsuppotedOperationException: DataType: varchar(120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org