[jira] [Commented] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434668#comment-17434668
 ] 

Apache Spark commented on SPARK-37031:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34399

> Unify v1 and v2 DESCRIBE NAMESPACE tests
> 
>
> Key: SPARK-37031
> URL: https://issues.apache.org/jira/browse/SPARK-37031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and 
> v2 datasources. Some tests can be places to V1 and V2 specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434667#comment-17434667
 ] 

Apache Spark commented on SPARK-37031:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34399

> Unify v1 and v2 DESCRIBE NAMESPACE tests
> 
>
> Key: SPARK-37031
> URL: https://issues.apache.org/jira/browse/SPARK-37031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and 
> v2 datasources. Some tests can be places to V1 and V2 specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests

2021-10-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37031.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34305
[https://github.com/apache/spark/pull/34305]

> Unify v1 and v2 DESCRIBE NAMESPACE tests
> 
>
> Key: SPARK-37031
> URL: https://issues.apache.org/jira/browse/SPARK-37031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and 
> v2 datasources. Some tests can be places to V1 and V2 specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests

2021-10-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37031:
---

Assignee: Terry Kim

> Unify v1 and v2 DESCRIBE NAMESPACE tests
> 
>
> Key: SPARK-37031
> URL: https://issues.apache.org/jira/browse/SPARK-37031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and 
> v2 datasources. Some tests can be places to V1 and V2 specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37127) Support non-literal frame bound value for window functions

2021-10-26 Thread Kernel Force (Jira)
Kernel Force created SPARK-37127:


 Summary: Support non-literal frame bound value for window functions
 Key: SPARK-37127
 URL: https://issues.apache.org/jira/browse/SPARK-37127
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.3
 Environment: Spark-3.0.3
Reporter: Kernel Force



{code:sql}
sql("""
with va as (
select 15 a, 100 b
 union all
select 15, 120
 union all
select 15, 130
 union all
select 15, 150
)
select t.*, 
   min(t.b) over(partition by t.a order by t.b range between 0.15*t.b 
preceding and current row) c 
  from va t 
""").show
{code}

throws 


{code:java}
org.apache.spark.sql.catalyst.parser.ParseException:
Frame bound value must be a literal.(line 12, pos 65)
{code}

The non-literal expression *0.15*t.b* might leads this exception.

But the non-literal frame bound value has already been support by oracle which 
is very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37125) Support AnsiInterval radix sort

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434654#comment-17434654
 ] 

Apache Spark commented on SPARK-37125:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34398

> Support AnsiInterval radix sort
> ---
>
> Key: SPARK-37125
> URL: https://issues.apache.org/jira/browse/SPARK-37125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The radix sort is more faster than timsort, the benchmark result can see in 
> `SortBenchmark`.
> Since the `AnsiInterval` data type is comparable:
> - `YearMonthIntervalType` -> int ordering
> - `DayTimeIntervalType` -> long ordering
> And we aslo support radix sort when the ordering column date type is int or 
> long.
> So `AnsiInterval` radix sort can be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37125) Support AnsiInterval radix sort

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434653#comment-17434653
 ] 

Apache Spark commented on SPARK-37125:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34398

> Support AnsiInterval radix sort
> ---
>
> Key: SPARK-37125
> URL: https://issues.apache.org/jira/browse/SPARK-37125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The radix sort is more faster than timsort, the benchmark result can see in 
> `SortBenchmark`.
> Since the `AnsiInterval` data type is comparable:
> - `YearMonthIntervalType` -> int ordering
> - `DayTimeIntervalType` -> long ordering
> And we aslo support radix sort when the ordering column date type is int or 
> long.
> So `AnsiInterval` radix sort can be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37125) Support AnsiInterval radix sort

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37125:


Assignee: Apache Spark

> Support AnsiInterval radix sort
> ---
>
> Key: SPARK-37125
> URL: https://issues.apache.org/jira/browse/SPARK-37125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> The radix sort is more faster than timsort, the benchmark result can see in 
> `SortBenchmark`.
> Since the `AnsiInterval` data type is comparable:
> - `YearMonthIntervalType` -> int ordering
> - `DayTimeIntervalType` -> long ordering
> And we aslo support radix sort when the ordering column date type is int or 
> long.
> So `AnsiInterval` radix sort can be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37125) Support AnsiInterval radix sort

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37125:


Assignee: (was: Apache Spark)

> Support AnsiInterval radix sort
> ---
>
> Key: SPARK-37125
> URL: https://issues.apache.org/jira/browse/SPARK-37125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The radix sort is more faster than timsort, the benchmark result can see in 
> `SortBenchmark`.
> Since the `AnsiInterval` data type is comparable:
> - `YearMonthIntervalType` -> int ordering
> - `DayTimeIntervalType` -> long ordering
> And we aslo support radix sort when the ordering column date type is int or 
> long.
> So `AnsiInterval` radix sort can be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37126) Support TimestampNTZ in PySpark

2021-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37126.
--
Fix Version/s: 3.3.0
   Resolution: Done

> Support TimestampNTZ in PySpark
> ---
>
> Key: SPARK-37126
> URL: https://issues.apache.org/jira/browse/SPARK-37126
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> This tickets aims TimestampNTZ support in PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37126) Support TimestampNTZ in PySpark

2021-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37126:
-
Issue Type: Improvement  (was: Epic)

> Support TimestampNTZ in PySpark
> ---
>
> Key: SPARK-37126
> URL: https://issues.apache.org/jira/browse/SPARK-37126
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> This tickets aims TimestampNTZ support in PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36661) Support TimestampNTZ in Py4J

2021-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36661:


Assignee: Hyukjin Kwon  (was: Hyukjin Kwon)

> Support TimestampNTZ in Py4J
> 
>
> Key: SPARK-36661
> URL: https://issues.apache.org/jira/browse/SPARK-36661
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37125) Support AnsiInterval radix sort

2021-10-26 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37125:
--
Parent: SPARK-27790
Issue Type: Sub-task  (was: Improvement)

> Support AnsiInterval radix sort
> ---
>
> Key: SPARK-37125
> URL: https://issues.apache.org/jira/browse/SPARK-37125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> The radix sort is more faster than timsort, the benchmark result can see in 
> `SortBenchmark`.
> Since the `AnsiInterval` data type is comparable:
> - `YearMonthIntervalType` -> int ordering
> - `DayTimeIntervalType` -> long ordering
> And we aslo support radix sort when the ordering column date type is int or 
> long.
> So `AnsiInterval` radix sort can be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37126) Support TimestampNTZ in PySpark

2021-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37126:


 Summary: Support TimestampNTZ in PySpark
 Key: SPARK-37126
 URL: https://issues.apache.org/jira/browse/SPARK-37126
 Project: Spark
  Issue Type: Epic
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


This tickets aims TimestampNTZ support in PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37125) Support AnsiInterval radix sort

2021-10-26 Thread XiDuo You (Jira)
XiDuo You created SPARK-37125:
-

 Summary: Support AnsiInterval radix sort
 Key: SPARK-37125
 URL: https://issues.apache.org/jira/browse/SPARK-37125
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


The radix sort is more faster than timsort, the benchmark result can see in 
`SortBenchmark`.

Since the `AnsiInterval` data type is comparable:
- `YearMonthIntervalType` -> int ordering
- `DayTimeIntervalType` -> long ordering

And we aslo support radix sort when the ordering column date type is int or 
long.

So `AnsiInterval` radix sort can be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job

2021-10-26 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434651#comment-17434651
 ] 

Yang Jie commented on SPARK-37120:
--

Thank [~dongjoon]

> Add Java17 GitHub Action build and test job
> ---
>
> Key: SPARK-37120
> URL: https://issues.apache.org/jira/browse/SPARK-37120
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Now run
> {code:java}
> build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> {code}
> to build and test whole project(Head is 
> 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
> UTs have passed.
>  
> {code:java}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  1.971 
> s]
> [INFO] Spark Project Tags . SUCCESS [  2.170 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 14.008 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.466 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 49.650 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  1.826 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  1.851 
> s]
> [INFO] Spark Project Core . SUCCESS [24:40 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [01:27 
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [07:56 
> min]
> [INFO] Spark Project SQL .. SUCCESS [  01:01 
> h]
> [INFO] Spark Project ML Library ... SUCCESS [16:46 
> min]
> [INFO] Spark Project Tools  SUCCESS [  0.748 
> s]
> [INFO] Spark Project Hive . SUCCESS [  01:11 
> h]
> [INFO] Spark Project REPL . SUCCESS [01:26 
> min]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 
> s]
> [INFO] Spark Project YARN . SUCCESS [06:54 
> min]
> [INFO] Spark Project Mesos  SUCCESS [ 46.913 
> s]
> [INFO] Spark Project Kubernetes ... SUCCESS [01:08 
> min]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 
> min]
> [INFO] Spark Ganglia Integration .. SUCCESS [  4.610 
> s]
> [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 
> s]
> [INFO] Spark Project Assembly . SUCCESS [  2.496 
> s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 
> s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 
> min]
> [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 
> s]
> [INFO] Spark Project Examples . SUCCESS [ 32.189 
> s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 
> s]
> [INFO] Spark Avro . SUCCESS [01:55 
> min]
> [INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 
> s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  04:19 h
> [INFO] Finished at: 2021-10-26T20:02:56+08:00
> [INFO] 
> 
> {code}
> So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434644#comment-17434644
 ] 

Apache Spark commented on SPARK-36348:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/34397

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35437:


Assignee: (was: Apache Spark)

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35437:


Assignee: Apache Spark

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434643#comment-17434643
 ] 

Apache Spark commented on SPARK-36348:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/34397

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-35437) Use expressions to filter Hive partitions at client side

2021-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-35437:
--
  Assignee: (was: dzcxzl)

Reverted at 
https://github.com/apache/spark/commit/fb9d6aeb788d2e869e09f18014c966b51aa3af20

> Use expressions to filter Hive partitions at client side
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Priority: Minor
> Fix For: 3.3.0
>
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37124) Support Writable ArrowColumnarVector

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37124:


Assignee: Apache Spark

> Support Writable ArrowColumnarVector
> 
>
> Key: SPARK-37124
> URL: https://issues.apache.org/jira/browse/SPARK-37124
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Major
>
> This Jira is aim to add Arrow format as an alternative for ColumnVector 
> solution.
> Current ArrowColumnVector is not fully equivalent to 
> OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more 
> stable, and using pandas udf will perform much better than python udf.
> I am  proposing to fully support arrow format as an alternative to 
> ColumnVector just like the other two.
> What I did in this PR is to create a new class in the same package with 
> OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support 
> all put APIs.
> UTs are covering all Data Format with testing on writing to columnVector and 
> reading from columnVector. I also added 3 UTs for testing on loading from 
> ArrowRecordBatch and allocateColumns .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37124) Support Writable ArrowColumnarVector

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37124:


Assignee: (was: Apache Spark)

> Support Writable ArrowColumnarVector
> 
>
> Key: SPARK-37124
> URL: https://issues.apache.org/jira/browse/SPARK-37124
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chendi.Xue
>Priority: Major
>
> This Jira is aim to add Arrow format as an alternative for ColumnVector 
> solution.
> Current ArrowColumnVector is not fully equivalent to 
> OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more 
> stable, and using pandas udf will perform much better than python udf.
> I am  proposing to fully support arrow format as an alternative to 
> ColumnVector just like the other two.
> What I did in this PR is to create a new class in the same package with 
> OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support 
> all put APIs.
> UTs are covering all Data Format with testing on writing to columnVector and 
> reading from columnVector. I also added 3 UTs for testing on loading from 
> ArrowRecordBatch and allocateColumns .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37124) Support Writable ArrowColumnarVector

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434624#comment-17434624
 ] 

Apache Spark commented on SPARK-37124:
--

User 'xuechendi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34396

> Support Writable ArrowColumnarVector
> 
>
> Key: SPARK-37124
> URL: https://issues.apache.org/jira/browse/SPARK-37124
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chendi.Xue
>Priority: Major
>
> This Jira is aim to add Arrow format as an alternative for ColumnVector 
> solution.
> Current ArrowColumnVector is not fully equivalent to 
> OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more 
> stable, and using pandas udf will perform much better than python udf.
> I am  proposing to fully support arrow format as an alternative to 
> ColumnVector just like the other two.
> What I did in this PR is to create a new class in the same package with 
> OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support 
> all put APIs.
> UTs are covering all Data Format with testing on writing to columnVector and 
> reading from columnVector. I also added 3 UTs for testing on loading from 
> ArrowRecordBatch and allocateColumns .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37124) Support Writable ArrowColumnarVector

2021-10-26 Thread Chendi.Xue (Jira)
Chendi.Xue created SPARK-37124:
--

 Summary: Support Writable ArrowColumnarVector
 Key: SPARK-37124
 URL: https://issues.apache.org/jira/browse/SPARK-37124
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chendi.Xue


This Jira is aim to add Arrow format as an alternative for ColumnVector 
solution.

Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector 
in spark, and since Arrow API is now being more stable, and using pandas udf 
will perform much better than python udf.

I am  proposing to fully support arrow format as an alternative to ColumnVector 
just like the other two.

What I did in this PR is to create a new class in the same package with 
OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all 
put APIs.

UTs are covering all Data Format with testing on writing to columnVector and 
reading from columnVector. I also added 3 UTs for testing on loading from 
ArrowRecordBatch and allocateColumns .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37123) Support Writable ArrowColumnarVector

2021-10-26 Thread Chendi.Xue (Jira)
Chendi.Xue created SPARK-37123:
--

 Summary: Support Writable ArrowColumnarVector
 Key: SPARK-37123
 URL: https://issues.apache.org/jira/browse/SPARK-37123
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chendi.Xue


This Jira is aim to add Arrow format as an alternative for ColumnVector 
solution.

Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector 
in spark, and since Arrow API is now being more stable, and using pandas udf 
will perform much better than python udf.

I am  proposing to fully support arrow format as an alternative to ColumnVector 
just like the other two.

What I did in this PR is to create a new class in the same package with 
OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all 
put APIs.

UTs are covering all Data Format with testing on writing to columnVector and 
reading from columnVector. I also added 3 UTs for testing on loading from 
ArrowRecordBatch and allocateColumns .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus

2021-10-26 Thread Biswa Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biswa Singh updated SPARK-37122:

Affects Version/s: (was: 3.0.2)

> java.lang.IllegalArgumentException Related to Prometheus
> 
>
> Key: SPARK-37122
> URL: https://issues.apache.org/jira/browse/SPARK-37122
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Biswa Singh
>Priority: Critical
>
> This issue is similar to 
> https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
>  We receive the Following warning continuously:
>  
> 21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler - 
> Exception in connection from 
> /10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 
> 5135603447297303916 at 
> org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Unknown Source)
>  
> Below are other details related to prometheus and my findings. Please SCROLL 
> DOWN to see the details:
>  
> {noformat}
> Prometheus Scrape Configuration
> ===
> - job_name: 'kubernetes-pods'
>   kubernetes_sd_configs:
> - role: pod
>   relabel_configs:
> - action: labelmap
>   regex: __meta_kubernetes_pod_label_(.+)
> - source_labels: [__meta_kubernetes_namespace]
>   action: replace
>   target_label: kubernetes_namespace
> - source_labels: [__meta_kubernetes_pod_name]
>   action: replace
>   target_label: kubernetes_pod_name
> - source_labels: 
> [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
>   action: keep
>   regex: true
> - source_labels: 
> [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
>   action: replace
>   target_label: __scheme__
>   regex: (https?)
> - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
>   action: replace
>   target_label: __metrics_path__
>   regex: (.+)
> - source_labels: [__address__, 
> __meta_kubernetes_pod_prometheus_io_port]
>   action: replace
>   target_label: __address__
>   regex: ([^:]+)(?::\d+)?;(\d+)
>   replacement: $1:$2
> tcptrack command output in spark3 pod
> ==
> 10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
> 10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
> 10.198.22.240:50354  10.198.40.143:7079  CLOSED 40s 0 B/s
> 10.198.22.240:33152  10.198.40.143:4040  ESTABLISHED 2s 0 B/s
> 10.198.22.240:47726  10.198.40.143:8090  ESTABLISHED 9s 0 B/s
> 10.198.22.240 = prometheus pod 
> ip10.198.40.143 = testpod ip 
> Issue
> ==
> Though the scrape config is expected to scrape on port 8090. I see prometheus 
> tries to initiate scrape on ports like 7079, 7078, 4040, etc on
> the spark3 pod and hence the exception in spark3 pod. But is this really a 
> 

[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus

2021-10-26 Thread Biswa Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biswa Singh updated SPARK-37122:

Description: 
This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning continuously:

 

21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler - 
Exception in connection from 
/10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 
5135603447297303916 at 
org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) 
at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Unknown Source)

 

Below are other details related to prometheus and my findings. Please SCROLL 
DOWN to see the details:

 
{noformat}
Prometheus Scrape Configuration
===
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
  action: replace
  target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
  action: replace
  target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  action: replace
  target_label: __scheme__
  regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  action: replace
  target_label: __metrics_path__
  regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]
  action: replace
  target_label: __address__
  regex: ([^:]+)(?::\d+)?;(\d+)
  replacement: $1:$2

tcptrack command output in spark3 pod
==
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:50354  10.198.40.143:7079  CLOSED 40s 0 B/s
10.198.22.240:33152  10.198.40.143:4040  ESTABLISHED 2s 0 B/s
10.198.22.240:47726  10.198.40.143:8090  ESTABLISHED 9s 0 B/s

10.198.22.240 = prometheus pod 

ip10.198.40.143 = testpod ip 

Issue
==
Though the scrape config is expected to scrape on port 8090. I see prometheus 
tries to initiate scrape on ports like 7079, 7078, 4040, etc on
the spark3 pod and hence the exception in spark3 pod. But is this really a 
prometheus issue or something at spark side? We don't see any such exception in 
any of the other pods. All our pods including spark3 are annotated with:

annotations:
   prometheus.io/port: "8090"
   prometheus.io/scrape: "true"

We get the metrics and everything fine just extra warning for this 
exception.{noformat}
 

  was:
This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning continuously:

 

21:00:26.277 [rpc-server-4-2] WARN  

[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus

2021-10-26 Thread Biswa Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biswa Singh updated SPARK-37122:

Description: 
This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning continuously:

 

21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler - 
Exception in connection from 
/10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 
5135603447297303916 at 
org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) 
at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Unknown Source)

 

Below are other details related to prometheus. Please scroll down to find out 
details of the issue:

 
{noformat}
Prometheus Scrape Configuration
===
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
  action: replace
  target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
  action: replace
  target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  action: replace
  target_label: __scheme__
  regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  action: replace
  target_label: __metrics_path__
  regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]
  action: replace
  target_label: __address__
  regex: ([^:]+)(?::\d+)?;(\d+)
  replacement: $1:$2

tcptrack command output in spark3 pod
==
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:50354  10.198.40.143:7079  CLOSED 40s 0 B/s
10.198.22.240:33152  10.198.40.143:4040  ESTABLISHED 2s 0 B/s
10.198.22.240:47726  10.198.40.143:8090  ESTABLISHED 9s 0 B/s

10.198.22.240 = prometheus pod 

ip10.198.40.143 = testpod ip 

Issue
==
Though the scrape config is expected to scrape on port 8090. I see prometheus 
tries to initiate scrape on ports like 7079, 7078, 4040, etc on
the spark3 pod and hence the exception in spark3 pod. But is this really a 
prometheus issue or something at spark side? We don't see any such exception in 
any of the other pods. All our pods including spark3 are annotated with:

annotations:
   prometheus.io/port: "8090"
   prometheus.io/scrape: "true"

We get the metrics and everything fine just extra warning for this 
exception.{noformat}
 

  was:
This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning:

 

21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler 

[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus

2021-10-26 Thread Biswa Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biswa Singh updated SPARK-37122:

Description: 
This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning:

 

21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler - 
Exception in connection from 
/10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 
5135603447297303916 at 
org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) 
at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Unknown Source)

 

Below are other details related to prometheus. Please scroll down to find out 
details of the issue:

 
{noformat}
Prometheus Scrape Configuration
===
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
  action: replace
  target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
  action: replace
  target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  action: replace
  target_label: __scheme__
  regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  action: replace
  target_label: __metrics_path__
  regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]
  action: replace
  target_label: __address__
  regex: ([^:]+)(?::\d+)?;(\d+)
  replacement: $1:$2

tcptrack command output in spark3 pod
==
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:50354  10.198.40.143:7079  CLOSED 40s 0 B/s
10.198.22.240:33152  10.198.40.143:4040  ESTABLISHED 2s 0 B/s
10.198.22.240:47726  10.198.40.143:8090  ESTABLISHED 9s 0 B/s

10.198.22.240 = prometheus pod 

ip10.198.40.143 = testpod ip 

Issue
==
Though the scrape config is expected to scrape on port 8090. I see prometheus 
tries to initiate scrape on ports like 7079, 7078, 4040, etc on
the spark3 pod and hence the exception in spark3 pod. But is this really a 
prometheus issue or something at spark side? We don't see any such exception in 
any of the other pods. All our pods including spark3 are annotated with:

annotations:
   prometheus.io/port: "8090"
   prometheus.io/scrape: "true"

We get the metrics and everything fine just extra warning for this 
exception.{noformat}
 

  was:
This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning:

 

 

21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler - 

[jira] [Commented] (SPARK-37109) Install Java 17 on all of the Jenkins workers

2021-10-26 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434581#comment-17434581
 ] 

Shane Knapp commented on SPARK-37109:
-

yep, jenkins is going away at the end of this year...  all support is currently 
'best effort'. 

> Install Java 17 on all of the Jenkins workers
> -
>
> Key: SPARK-37109
> URL: https://issues.apache.org/jira/browse/SPARK-37109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus

2021-10-26 Thread Biswa Singh (Jira)
Biswa Singh created SPARK-37122:
---

 Summary: java.lang.IllegalArgumentException Related to Prometheus
 Key: SPARK-37122
 URL: https://issues.apache.org/jira/browse/SPARK-37122
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.1, 3.0.2
Reporter: Biswa Singh


This issue is similar to 
https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723.
 We receive the Following warning:

 

 

21:00:26.277 [rpc-server-4-2] WARN  o.a.s.n.s.TransportChannelHandler - 
Exception in connection from 
/10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 
5135603447297303916 at 
org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) 
at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Unknown Source)

 

Below are other details related to prometheus.

 
{noformat}

Prometheus Scrape Configuration
===
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
  action: replace
  target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
  action: replace
  target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  action: replace
  target_label: __scheme__
  regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  action: replace
  target_label: __metrics_path__
  regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]
  action: replace
  target_label: __address__
  regex: ([^:]+)(?::\d+)?;(\d+)
  replacement: $1:$2

tcptrack command output in spark3 pod
==
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:51258  10.198.40.143:7079  CLOSED 10s 0 B/s
10.198.22.240:50354  10.198.40.143:7079  CLOSED 40s 0 B/s
10.198.22.240:33152  10.198.40.143:4040  ESTABLISHED 2s 0 B/s
10.198.22.240:47726  10.198.40.143:8090  ESTABLISHED 9s 0 B/s

10.198.22.240 = prometheus pod 

ip10.198.40.143 = testpod ip 

Issue
==
Though the scrape config is expected to scrape on port 8090. I see prometheus 
tries to initiate scrape on ports like 7079, 7078, 4040, etc on
the spark3 pod and hence the exception in spark3 pod. But is this really a 
prometheus issue or something at spark side? We don't see any such exception in 
any of the other pods. All our pods including spark3 are annotated with:

annotations:
   prometheus.io/port: "8090"
   prometheus.io/scrape: "true"

We get the metrics and everything fine just extra warning for this 
exception.{noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434558#comment-17434558
 ] 

Dongjoon Hyun commented on SPARK-37120:
---

cc [~hyukjin.kwon]

> Add Java17 GitHub Action build and test job
> ---
>
> Key: SPARK-37120
> URL: https://issues.apache.org/jira/browse/SPARK-37120
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Now run
> {code:java}
> build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> {code}
> to build and test whole project(Head is 
> 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
> UTs have passed.
>  
> {code:java}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  1.971 
> s]
> [INFO] Spark Project Tags . SUCCESS [  2.170 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 14.008 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.466 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 49.650 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  1.826 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  1.851 
> s]
> [INFO] Spark Project Core . SUCCESS [24:40 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [01:27 
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [07:56 
> min]
> [INFO] Spark Project SQL .. SUCCESS [  01:01 
> h]
> [INFO] Spark Project ML Library ... SUCCESS [16:46 
> min]
> [INFO] Spark Project Tools  SUCCESS [  0.748 
> s]
> [INFO] Spark Project Hive . SUCCESS [  01:11 
> h]
> [INFO] Spark Project REPL . SUCCESS [01:26 
> min]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 
> s]
> [INFO] Spark Project YARN . SUCCESS [06:54 
> min]
> [INFO] Spark Project Mesos  SUCCESS [ 46.913 
> s]
> [INFO] Spark Project Kubernetes ... SUCCESS [01:08 
> min]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 
> min]
> [INFO] Spark Ganglia Integration .. SUCCESS [  4.610 
> s]
> [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 
> s]
> [INFO] Spark Project Assembly . SUCCESS [  2.496 
> s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 
> s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 
> min]
> [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 
> s]
> [INFO] Spark Project Examples . SUCCESS [ 32.189 
> s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 
> s]
> [INFO] Spark Avro . SUCCESS [01:55 
> min]
> [INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 
> s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  04:19 h
> [INFO] Finished at: 2021-10-26T20:02:56+08:00
> [INFO] 
> 
> {code}
> So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434557#comment-17434557
 ] 

Dongjoon Hyun commented on SPARK-37120:
---

I update the JIRA title to target GitHub Action job.

> Add Java17 GitHub Action build and test job
> ---
>
> Key: SPARK-37120
> URL: https://issues.apache.org/jira/browse/SPARK-37120
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Now run
> {code:java}
> build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> {code}
> to build and test whole project(Head is 
> 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
> UTs have passed.
>  
> {code:java}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  1.971 
> s]
> [INFO] Spark Project Tags . SUCCESS [  2.170 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 14.008 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.466 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 49.650 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  1.826 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  1.851 
> s]
> [INFO] Spark Project Core . SUCCESS [24:40 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [01:27 
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [07:56 
> min]
> [INFO] Spark Project SQL .. SUCCESS [  01:01 
> h]
> [INFO] Spark Project ML Library ... SUCCESS [16:46 
> min]
> [INFO] Spark Project Tools  SUCCESS [  0.748 
> s]
> [INFO] Spark Project Hive . SUCCESS [  01:11 
> h]
> [INFO] Spark Project REPL . SUCCESS [01:26 
> min]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 
> s]
> [INFO] Spark Project YARN . SUCCESS [06:54 
> min]
> [INFO] Spark Project Mesos  SUCCESS [ 46.913 
> s]
> [INFO] Spark Project Kubernetes ... SUCCESS [01:08 
> min]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 
> min]
> [INFO] Spark Ganglia Integration .. SUCCESS [  4.610 
> s]
> [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 
> s]
> [INFO] Spark Project Assembly . SUCCESS [  2.496 
> s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 
> s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 
> min]
> [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 
> s]
> [INFO] Spark Project Examples . SUCCESS [ 32.189 
> s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 
> s]
> [INFO] Spark Avro . SUCCESS [01:55 
> min]
> [INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 
> s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  04:19 h
> [INFO] Finished at: 2021-10-26T20:02:56+08:00
> [INFO] 
> 
> {code}
> So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37120) Add Java17 GitHub Action build and test job

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37120:
--
Summary: Add Java17 GitHub Action build and test job  (was: Add a Jenkins 
build and test job for Java 17)

> Add Java17 GitHub Action build and test job
> ---
>
> Key: SPARK-37120
> URL: https://issues.apache.org/jira/browse/SPARK-37120
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Now run
> {code:java}
> build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> {code}
> to build and test whole project(Head is 
> 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
> UTs have passed.
>  
> {code:java}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  1.971 
> s]
> [INFO] Spark Project Tags . SUCCESS [  2.170 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 14.008 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.466 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 49.650 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  1.826 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  1.851 
> s]
> [INFO] Spark Project Core . SUCCESS [24:40 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [01:27 
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [07:56 
> min]
> [INFO] Spark Project SQL .. SUCCESS [  01:01 
> h]
> [INFO] Spark Project ML Library ... SUCCESS [16:46 
> min]
> [INFO] Spark Project Tools  SUCCESS [  0.748 
> s]
> [INFO] Spark Project Hive . SUCCESS [  01:11 
> h]
> [INFO] Spark Project REPL . SUCCESS [01:26 
> min]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 
> s]
> [INFO] Spark Project YARN . SUCCESS [06:54 
> min]
> [INFO] Spark Project Mesos  SUCCESS [ 46.913 
> s]
> [INFO] Spark Project Kubernetes ... SUCCESS [01:08 
> min]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 
> min]
> [INFO] Spark Ganglia Integration .. SUCCESS [  4.610 
> s]
> [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 
> s]
> [INFO] Spark Project Assembly . SUCCESS [  2.496 
> s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 
> s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 
> min]
> [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 
> s]
> [INFO] Spark Project Examples . SUCCESS [ 32.189 
> s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 
> s]
> [INFO] Spark Avro . SUCCESS [01:55 
> min]
> [INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 
> s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  04:19 h
> [INFO] Finished at: 2021-10-26T20:02:56+08:00
> [INFO] 
> 
> {code}
> So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37098) Alter table properties should invalidate cache

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37098:
--
Fix Version/s: 3.0.4

> Alter table properties should invalidate cache
> --
>
> Key: SPARK-37098
> URL: https://issues.apache.org/jira/browse/SPARK-37098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.1.3, 3.0.4, 3.2.1, 3.3.0
>
>
> The table properties can change the behavior of wriing. e.g. the parquet 
> table with `parquet.compression`.
> If you execute the following SQL, we will get the file with snappy 
> compression rather than zstd.
> {code:java}
> CREATE TABLE t (c int) STORED AS PARQUET;
> // cache table metadata
> SELECT * FROM t;
> ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd');
> INSERT INTO TABLE t values(1);
> {code}
> So we should invalidate the table cache after alter table properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers Managed Spark Clusters

2021-10-26 Thread Naga Vijayapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga Vijayapuram updated SPARK-37114:
-
Priority: Minor  (was: Trivial)

> Support Submitting Jobs to Cloud Providers Managed Spark Clusters
> -
>
> Key: SPARK-37114
> URL: https://issues.apache.org/jira/browse/SPARK-37114
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Naga Vijayapuram
>Priority: Minor
>
> To be able to submit jobs to prominent cloud providers managed spark 
> clusters, "spark-submit" can be enhanced. For example, to submit job to 
> "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud 
> dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is 
> used. Once this feature is accepted and prioritized, then it can be rolled 
> out in current and future versions of spark and also back ported to a few 
> previous versions. I can raise the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-10-26 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434515#comment-17434515
 ] 

Shardul Mahadik commented on SPARK-36877:
-

Was able to get around this by re-using the RDD for further DF operations
{code:scala}
val df = /* some expensive multi-table/multi-stage join */
val rdd = df.rdd
val numPartitions = rdd.getNumPartitions
val dfFromRdd = spark.createDataset(rdd)(df.encoder)
dfFromRdd.repartition(x).write.
{code}

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-10-26 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik resolved SPARK-36877.
-
Resolution: Not A Problem

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36895) Add Create Index syntax support

2021-10-26 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36895.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34148
[https://github.com/apache/spark/pull/34148]

> Add Create Index syntax support
> ---
>
> Key: SPARK-36895
> URL: https://issues.apache.org/jira/browse/SPARK-36895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36895) Add Create Index syntax support

2021-10-26 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36895:
---

Assignee: Huaxin Gao

> Add Create Index syntax support
> ---
>
> Key: SPARK-36895
> URL: https://issues.apache.org/jira/browse/SPARK-36895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37121:


Assignee: Apache Spark

> TestUtils.isPythonVersionAtLeast38 returns incorrect results
> 
>
> Key: SPARK-37121
> URL: https://issues.apache.org/jira/browse/SPARK-37121
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed 
> that it was never running against the Spark 2.x release lines, only the 3.x 
> ones. The problem was coming from here, specifically the Python 3.8+ version 
> check:
> {code}
> versions
>   .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38())
>   .filter(v => v.startsWith("3") || 
> !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9))
> {code}
> I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning 
> true, even when my system installation of Python3 was 3.7. Thinking it was an 
> environment issue, I pulled up a debugger to check which version of Python 
> the test JVM was seeing, and it was in fact Python 3.7.
> Turns out the issue is with the {{isPythonVersionAtLeast38}} method:
> {code}
>   def isPythonVersionAtLeast38(): Boolean = {
> val attempt = if (Utils.isWindows) {
>   Try(Process(Seq("cmd.exe", "/C", "python3 --version"))
> .run(ProcessLogger(s => s.startsWith("Python 3.8") || 
> s.startsWith("Python 3.9")))
> .exitValue())
> } else {
>   Try(Process(Seq("sh", "-c", "python3 --version"))
> .run(ProcessLogger(s => s.startsWith("Python 3.8") || 
> s.startsWith("Python 3.9")))
> .exitValue())
> }
> attempt.isSuccess && attempt.get == 0
>   }
> {code}
> It's trying to evaluate the version of Python using a {{ProcessLogger}}, but 
> the logger accepts a {{String => Unit}} function, i.e., it does not make use 
> of the return value in any way (since it's meant for logging). So the result 
> of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && 
> attempt.get == 0}} will always be true as long as your system has a 
> {{python3}} binary of any version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37121:


Assignee: (was: Apache Spark)

> TestUtils.isPythonVersionAtLeast38 returns incorrect results
> 
>
> Key: SPARK-37121
> URL: https://issues.apache.org/jira/browse/SPARK-37121
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Erik Krogen
>Priority: Major
>
> I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed 
> that it was never running against the Spark 2.x release lines, only the 3.x 
> ones. The problem was coming from here, specifically the Python 3.8+ version 
> check:
> {code}
> versions
>   .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38())
>   .filter(v => v.startsWith("3") || 
> !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9))
> {code}
> I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning 
> true, even when my system installation of Python3 was 3.7. Thinking it was an 
> environment issue, I pulled up a debugger to check which version of Python 
> the test JVM was seeing, and it was in fact Python 3.7.
> Turns out the issue is with the {{isPythonVersionAtLeast38}} method:
> {code}
>   def isPythonVersionAtLeast38(): Boolean = {
> val attempt = if (Utils.isWindows) {
>   Try(Process(Seq("cmd.exe", "/C", "python3 --version"))
> .run(ProcessLogger(s => s.startsWith("Python 3.8") || 
> s.startsWith("Python 3.9")))
> .exitValue())
> } else {
>   Try(Process(Seq("sh", "-c", "python3 --version"))
> .run(ProcessLogger(s => s.startsWith("Python 3.8") || 
> s.startsWith("Python 3.9")))
> .exitValue())
> }
> attempt.isSuccess && attempt.get == 0
>   }
> {code}
> It's trying to evaluate the version of Python using a {{ProcessLogger}}, but 
> the logger accepts a {{String => Unit}} function, i.e., it does not make use 
> of the return value in any way (since it's meant for logging). So the result 
> of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && 
> attempt.get == 0}} will always be true as long as your system has a 
> {{python3}} binary of any version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434503#comment-17434503
 ] 

Apache Spark commented on SPARK-37121:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/34395

> TestUtils.isPythonVersionAtLeast38 returns incorrect results
> 
>
> Key: SPARK-37121
> URL: https://issues.apache.org/jira/browse/SPARK-37121
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Erik Krogen
>Priority: Major
>
> I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed 
> that it was never running against the Spark 2.x release lines, only the 3.x 
> ones. The problem was coming from here, specifically the Python 3.8+ version 
> check:
> {code}
> versions
>   .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38())
>   .filter(v => v.startsWith("3") || 
> !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9))
> {code}
> I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning 
> true, even when my system installation of Python3 was 3.7. Thinking it was an 
> environment issue, I pulled up a debugger to check which version of Python 
> the test JVM was seeing, and it was in fact Python 3.7.
> Turns out the issue is with the {{isPythonVersionAtLeast38}} method:
> {code}
>   def isPythonVersionAtLeast38(): Boolean = {
> val attempt = if (Utils.isWindows) {
>   Try(Process(Seq("cmd.exe", "/C", "python3 --version"))
> .run(ProcessLogger(s => s.startsWith("Python 3.8") || 
> s.startsWith("Python 3.9")))
> .exitValue())
> } else {
>   Try(Process(Seq("sh", "-c", "python3 --version"))
> .run(ProcessLogger(s => s.startsWith("Python 3.8") || 
> s.startsWith("Python 3.9")))
> .exitValue())
> }
> attempt.isSuccess && attempt.get == 0
>   }
> {code}
> It's trying to evaluate the version of Python using a {{ProcessLogger}}, but 
> the logger accepts a {{String => Unit}} function, i.e., it does not make use 
> of the return value in any way (since it's meant for logging). So the result 
> of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && 
> attempt.get == 0}} will always be true as long as your system has a 
> {{python3}} binary of any version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results

2021-10-26 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-37121:
---

 Summary: TestUtils.isPythonVersionAtLeast38 returns incorrect 
results
 Key: SPARK-37121
 URL: https://issues.apache.org/jira/browse/SPARK-37121
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.2.0
Reporter: Erik Krogen


I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed that 
it was never running against the Spark 2.x release lines, only the 3.x ones. 
The problem was coming from here, specifically the Python 3.8+ version check:
{code}
versions
  .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38())
  .filter(v => v.startsWith("3") || 
!SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9))
{code}

I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning 
true, even when my system installation of Python3 was 3.7. Thinking it was an 
environment issue, I pulled up a debugger to check which version of Python the 
test JVM was seeing, and it was in fact Python 3.7.

Turns out the issue is with the {{isPythonVersionAtLeast38}} method:
{code}
  def isPythonVersionAtLeast38(): Boolean = {
val attempt = if (Utils.isWindows) {
  Try(Process(Seq("cmd.exe", "/C", "python3 --version"))
.run(ProcessLogger(s => s.startsWith("Python 3.8") || 
s.startsWith("Python 3.9")))
.exitValue())
} else {
  Try(Process(Seq("sh", "-c", "python3 --version"))
.run(ProcessLogger(s => s.startsWith("Python 3.8") || 
s.startsWith("Python 3.9")))
.exitValue())
}
attempt.isSuccess && attempt.get == 0
  }
{code}
It's trying to evaluate the version of Python using a {{ProcessLogger}}, but 
the logger accepts a {{String => Unit}} function, i.e., it does not make use of 
the return value in any way (since it's meant for logging). So the result of 
the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && 
attempt.get == 0}} will always be true as long as your system has a {{python3}} 
binary of any version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434380#comment-17434380
 ] 

Apache Spark commented on SPARK-37118:
--

User 'remykarem' has created a pull request for this issue:
https://github.com/apache/spark/pull/34394

> Add KMeans distanceMeasure param to PythonMLLibAPI
> --
>
> Key: SPARK-37118
> URL: https://issues.apache.org/jira/browse/SPARK-37118
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 3.2.1
>Reporter: Raimi bin Karim
>Priority: Trivial
> Fix For: 3.2.1
>
>
> SPARK-22119 added KMeans {{distanceMeasure}} to the Python API.
> We should include this parameter too in the 
> {{PythonMLLibAPI.t}}{{rainKMeansModel}} method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37119) parse_url can not handle `{` and `}` correctly

2021-10-26 Thread Liu Shuo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434326#comment-17434326
 ] 

Liu Shuo commented on SPARK-37119:
--

As discussed in `https://github.com/apache/spark/pull/30333` and 
`https://github.com/apache/spark/pull/30399`, close this JIRA.

> parse_url can not handle `{` and `}` correctly
> --
>
> Key: SPARK-37119
> URL: https://issues.apache.org/jira/browse/SPARK-37119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.2.0, 3.3.0
>Reporter: Liu Shuo
>Priority: Critical
>
> when we execute the follow sql command
> {code:java}
> select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY')
> {code}
> the expected result:
>     query=\{aa}
> the actual result:
>     null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37120) Add a Jenkins build and test job for Java 17

2021-10-26 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434319#comment-17434319
 ] 

Yang Jie commented on SPARK-37120:
--

ping [~sowen] [~dongjoon]  ,do we need to do this now? Who should we ask to 
help finish it ?

  

> Add a Jenkins build and test job for Java 17
> 
>
> Key: SPARK-37120
> URL: https://issues.apache.org/jira/browse/SPARK-37120
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> Now run
> {code:java}
> build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> {code}
> to build and test whole project(Head is 
> 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
> UTs have passed.
>  
> {code:java}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  1.971 
> s]
> [INFO] Spark Project Tags . SUCCESS [  2.170 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 14.008 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.466 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 49.650 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  1.826 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  1.851 
> s]
> [INFO] Spark Project Core . SUCCESS [24:40 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [01:27 
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [07:56 
> min]
> [INFO] Spark Project SQL .. SUCCESS [  01:01 
> h]
> [INFO] Spark Project ML Library ... SUCCESS [16:46 
> min]
> [INFO] Spark Project Tools  SUCCESS [  0.748 
> s]
> [INFO] Spark Project Hive . SUCCESS [  01:11 
> h]
> [INFO] Spark Project REPL . SUCCESS [01:26 
> min]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 
> s]
> [INFO] Spark Project YARN . SUCCESS [06:54 
> min]
> [INFO] Spark Project Mesos  SUCCESS [ 46.913 
> s]
> [INFO] Spark Project Kubernetes ... SUCCESS [01:08 
> min]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 
> min]
> [INFO] Spark Ganglia Integration .. SUCCESS [  4.610 
> s]
> [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 
> s]
> [INFO] Spark Project Assembly . SUCCESS [  2.496 
> s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 
> s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 
> min]
> [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 
> s]
> [INFO] Spark Project Examples . SUCCESS [ 32.189 
> s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 
> s]
> [INFO] Spark Avro . SUCCESS [01:55 
> min]
> [INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 
> s]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  04:19 h
> [INFO] Finished at: 2021-10-26T20:02:56+08:00
> [INFO] 
> 
> {code}
> So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37120) Add a Jenkins build and test job for Java 17

2021-10-26 Thread Yang Jie (Jira)
Yang Jie created SPARK-37120:


 Summary: Add a Jenkins build and test job for Java 17
 Key: SPARK-37120
 URL: https://issues.apache.org/jira/browse/SPARK-37120
 Project: Spark
  Issue Type: Sub-task
  Components: jenkins
Affects Versions: 3.3.0
Reporter: Yang Jie


Now run
{code:java}
build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
-Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
{code}
to build and test whole project(Head is 
87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the 
UTs have passed.

 
{code:java}
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  1.971 s]
[INFO] Spark Project Tags . SUCCESS [  2.170 s]
[INFO] Spark Project Sketch ... SUCCESS [ 14.008 s]
[INFO] Spark Project Local DB . SUCCESS [  2.466 s]
[INFO] Spark Project Networking ... SUCCESS [ 49.650 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.095 s]
[INFO] Spark Project Unsafe ... SUCCESS [  1.826 s]
[INFO] Spark Project Launcher . SUCCESS [  1.851 s]
[INFO] Spark Project Core . SUCCESS [24:40 min]
[INFO] Spark Project ML Local Library . SUCCESS [ 17.816 s]
[INFO] Spark Project GraphX ... SUCCESS [01:27 min]
[INFO] Spark Project Streaming  SUCCESS [04:57 min]
[INFO] Spark Project Catalyst . SUCCESS [07:56 min]
[INFO] Spark Project SQL .. SUCCESS [  01:01 h]
[INFO] Spark Project ML Library ... SUCCESS [16:46 min]
[INFO] Spark Project Tools  SUCCESS [  0.748 s]
[INFO] Spark Project Hive . SUCCESS [  01:11 h]
[INFO] Spark Project REPL . SUCCESS [01:26 min]
[INFO] Spark Project YARN Shuffle Service . SUCCESS [  0.967 s]
[INFO] Spark Project YARN . SUCCESS [06:54 min]
[INFO] Spark Project Mesos  SUCCESS [ 46.913 s]
[INFO] Spark Project Kubernetes ... SUCCESS [01:08 min]
[INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 min]
[INFO] Spark Ganglia Integration .. SUCCESS [  4.610 s]
[INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 s]
[INFO] Spark Project Assembly . SUCCESS [  2.496 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 s]
[INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 min]
[INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [35:06 min]
[INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 s]
[INFO] Spark Project Examples . SUCCESS [ 32.189 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [  0.949 s]
[INFO] Spark Avro . SUCCESS [01:55 min]
[INFO] Spark Project Kinesis Assembly . SUCCESS [  1.104 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time:  04:19 h
[INFO] Finished at: 2021-10-26T20:02:56+08:00
[INFO] 
{code}
So should we add a Jenkins build and test job for Java 17?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37110) Add Java 17 support for spark pull request builds

2021-10-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434307#comment-17434307
 ] 

Hyukjin Kwon commented on SPARK-37110:
--

[~yumwang], when you find some time, feel free to set up JDK 17 and JDK 11. We 
will need some changes like https://github.com/apache/spark/pull/34091 and 
https://github.com/apache/spark/pull/34217. I was planning to do it but I am 
currently stuck in some internal works ...

> Add Java 17 support for spark pull request builds
> -
>
> Key: SPARK-37110
> URL: https://issues.apache.org/jira/browse/SPARK-37110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37119) parse_url can not handle `{` and `}` correctly

2021-10-26 Thread Liu Shuo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-37119:
-
Description: 
when we execute the follow sql command
{code:java}
select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY')
{code}
the expected result:

    query=\{aa}

the actual result:

    null

  was:
when we execute the follow sql command
{code:java}
select  parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY')
{code}
the expected result: query=\{aa}

the actual result: null


> parse_url can not handle `{` and `}` correctly
> --
>
> Key: SPARK-37119
> URL: https://issues.apache.org/jira/browse/SPARK-37119
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.2.0, 3.3.0
>Reporter: Liu Shuo
>Priority: Critical
>
> when we execute the follow sql command
> {code:java}
> select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY')
> {code}
> the expected result:
>     query=\{aa}
> the actual result:
>     null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37119) parse_url can not handle `{` and `}` correctly

2021-10-26 Thread Liu Shuo (Jira)
Liu Shuo created SPARK-37119:


 Summary: parse_url can not handle `{` and `}` correctly
 Key: SPARK-37119
 URL: https://issues.apache.org/jira/browse/SPARK-37119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 2.4.8, 3.3.0
Reporter: Liu Shuo


when we execute the follow sql command
{code:java}
select  parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY')
{code}
the expected result: query=\{aa}

the actual result: null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37118:


Assignee: (was: Apache Spark)

> Add KMeans distanceMeasure param to PythonMLLibAPI
> --
>
> Key: SPARK-37118
> URL: https://issues.apache.org/jira/browse/SPARK-37118
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 3.2.1
>Reporter: Raimi bin Karim
>Priority: Trivial
> Fix For: 3.2.1
>
>
> SPARK-22119 added KMeans {{distanceMeasure}} to the Python API.
> We should include this parameter too in the 
> {{PythonMLLibAPI.t}}{{rainKMeansModel}} method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI

2021-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434289#comment-17434289
 ] 

Apache Spark commented on SPARK-37118:
--

User 'remykarem' has created a pull request for this issue:
https://github.com/apache/spark/pull/34393

> Add KMeans distanceMeasure param to PythonMLLibAPI
> --
>
> Key: SPARK-37118
> URL: https://issues.apache.org/jira/browse/SPARK-37118
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 3.2.1
>Reporter: Raimi bin Karim
>Priority: Trivial
> Fix For: 3.2.1
>
>
> SPARK-22119 added KMeans {{distanceMeasure}} to the Python API.
> We should include this parameter too in the 
> {{PythonMLLibAPI.t}}{{rainKMeansModel}} method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI

2021-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37118:


Assignee: Apache Spark

> Add KMeans distanceMeasure param to PythonMLLibAPI
> --
>
> Key: SPARK-37118
> URL: https://issues.apache.org/jira/browse/SPARK-37118
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 3.2.1
>Reporter: Raimi bin Karim
>Assignee: Apache Spark
>Priority: Trivial
> Fix For: 3.2.1
>
>
> SPARK-22119 added KMeans {{distanceMeasure}} to the Python API.
> We should include this parameter too in the 
> {{PythonMLLibAPI.t}}{{rainKMeansModel}} method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI

2021-10-26 Thread Raimi bin Karim (Jira)
Raimi bin Karim created SPARK-37118:
---

 Summary: Add KMeans distanceMeasure param to PythonMLLibAPI
 Key: SPARK-37118
 URL: https://issues.apache.org/jira/browse/SPARK-37118
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 3.2.1
Reporter: Raimi bin Karim
 Fix For: 3.2.1


SPARK-22119 added KMeans {{distanceMeasure}} to the Python API.

We should include this parameter too in the 
{{PythonMLLibAPI.t}}{{rainKMeansModel}} method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434269#comment-17434269
 ] 

Dongjoon Hyun commented on SPARK-35181:
---

If you want to get some help, please use the official *Apache Spark 3.2.0* 
instead of your production Spark and give us a reproducible example.

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434263#comment-17434263
 ] 

Dongjoon Hyun commented on SPARK-35181:
---

I have no clue for those issues, but since it's JVM Runtime Error, why don't 
you try to use the latest Java 11 or Java 8? 
1.8.0_232 looks like 2019 version.

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434258#comment-17434258
 ] 

angerszhu commented on SPARK-35181:
---

[~dongjoon] Yea, only when use ``spark.io.compression.codec=zstd` this error 
happend.  This error is not happen when writing/reading parquet. 

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434257#comment-17434257
 ] 

Dongjoon Hyun edited comment on SPARK-35181 at 10/26/21, 10:46 AM:
---

BTW, according to the log, if you are trying to use Parquet with ZSTD, it's 
irrelevant with `spark.io.compression.codec`. You had better file an Apache 
Parquet JIRA, not Apache Spark JIRA.
{code}
CodecPool:184 - Got brand-new decompressor [.zst]
{code}


was (Author: dongjoon):
BTW, according to the log, if you are trying to use Parquet with ZSTD, it's 
irrelevant with `spark.io.compression.codec`. You had better file file an 
Apache Parquet JIRA, not Apache Spark JIRA.
{code}
CodecPool:184 - Got brand-new decompressor [.zst]
{code}

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434257#comment-17434257
 ] 

Dongjoon Hyun commented on SPARK-35181:
---

BTW, according to the log, if you are trying to use Parquet with ZSTD, it's 
irrelevant with `spark.io.compression.codec`. You had better file file an 
Apache Parquet JIRA, not Apache Spark JIRA.
{code}
CodecPool:184 - Got brand-new decompressor [.zst]
{code}

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434252#comment-17434252
 ] 

Dongjoon Hyun commented on SPARK-35181:
---

I'm not sure how you build and configure your environment and what you are 
hitting there.
`spark.io.compression.codec=zstd` is not unstable, [~angerszhuuu].
Are you sure that the errors are relevant to `spark.io.compression.codec=zstd`?

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default

2021-10-26 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434241#comment-17434241
 ] 

angerszhu commented on SPARK-35181:
---

[~dongjoon] This issue resolve when we upgrade the zstd version, but another 
issue happened.
{code:java}
2021-10-25 13:42:01 WARN  SparkConf:69 - The configuration key 
'spark.blacklist.application.fetchFailure.enabled' has been deprecated as of 
Spark 3.1.0 and may be removed in the future. Please use 
spark.excludeOnFailure.application.fetchFailure.enabled
2021-10-25 13:42:01 WARN  SparkConf:69 - The configuration key 
'spark.blacklist.enabled' has been deprecated as of Spark 3.1.0 and may be 
removed in the future. Please use spark.excludeOnFailure.enabled
2021-10-25 13:42:01 WARN  SparkConf:69 - The configuration key 
'spark.blacklist.killBlacklistedExecutors' has been deprecated as of Spark 
3.1.0 and may be removed in the future. Please use 
spark.excludeOnFailure.killExcludedExecutors
2021-10-25 13:42:02 INFO  EventMetricSparkPlugin:20 - Start to register event 
process metric plugin.
2021-10-25 13:42:10 INFO  deprecation:1398 - No unit for 
dfs.client.datanode-restart.timeout(30) assuming SECONDS
2021-10-25 13:42:10 INFO  deprecation:1398 - No unit for 
dfs.client.datanode-restart.timeout(30) assuming SECONDS
2021-10-25 13:42:23 INFO  CodecPool:184 - Got brand-new decompressor [.zst]
2021-10-25 13:42:23 INFO  CodecPool:184 - Got brand-new decompressor [.zst]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f4017bb3112, pid=58809, tid=0x7f402cffe700
#
# JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzstd-jni-1.5.0-28889732549921047792.so+0xc6112]
#
# Core dump written. Default location: 
/mnt/ssd/0/yarn/nm-local-dir/usercache/staging_data_trafficmart/appcache/application_1632999515383_3679724/container_e238_1632999515383_3679724_02_02/core
 or core.58809
#
# An error report file with more information is saved as:
# 
/mnt/ssd/0/yarn/nm-local-dir/usercache/staging_data_trafficmart/appcache/application_1632999515383_3679724/container_e238_1632999515383_3679724_02_02/hs_err_pid58809.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
{code}
Seems zstd so unstable? Or it's related to our zstd env problem?

> Use zstd for spark.io.compression.codec by default
> --
>
> Key: SPARK-35181
> URL: https://issues.apache.org/jira/browse/SPARK-35181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37109) Install Java 17 on all of the Jenkins workers

2021-10-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-37109.
-
Resolution: Won't Do

OK. I do not know this plan.

> Install Java 17 on all of the Jenkins workers
> -
>
> Key: SPARK-37109
> URL: https://issues.apache.org/jira/browse/SPARK-37109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37110) Add Java 17 support for spark pull request builds

2021-10-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-37110.
-
Resolution: Won't Do

OK. I do not know this plan.

> Add Java 17 support for spark pull request builds
> -
>
> Key: SPARK-37110
> URL: https://issues.apache.org/jira/browse/SPARK-37110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434214#comment-17434214
 ] 

Dongjoon Hyun commented on SPARK-37051:
---

Got it. If that happens on Parquet, we had better drop `ORC` from the JIRA 
title. I removed it first.
> This scenario also occur on Parquet.

> The filter operator gets wrong results in char type
> ---
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in char type

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37051:
--
Summary: The filter operator gets wrong results in char type  (was: The 
filter operator gets wrong results in ORC's char type)

> The filter operator gets wrong results in char type
> ---
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37117) Can't read files in one of Parquet encryption modes (external keymaterial)

2021-10-26 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created SPARK-37117:


 Summary: Can't read files in one of Parquet encryption modes 
(external keymaterial) 
 Key: SPARK-37117
 URL: https://issues.apache.org/jira/browse/SPARK-37117
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gidon Gershinsky


Parquet encryption has a number of modes. One of them is "external 
keymaterial", which keeps encrypted data keys in a separate file (as opposed to 
inside Parquet file). Upon reading, the Spark Parquet connector does not pass 
the file path, which causes an NPE. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-26 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434201#comment-17434201
 ] 

frankli edited comment on SPARK-37051 at 10/26/21, 8:48 AM:


This scenario also occur on Parquet. [~dongjoon]

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.


was (Author: frankli):
This scenario also occur on Parquet.

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was 

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-26 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434201#comment-17434201
 ] 

frankli commented on SPARK-37051:
-

This scenario also occur on Parquet.

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434196#comment-17434196
 ] 

Dongjoon Hyun commented on SPARK-37051:
---

Does Parquet work in those scenario, [~frankli] and [~wangzhun]?

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36989) Migrate type hint data tests

2021-10-26 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-36989.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34296
[https://github.com/apache/spark/pull/34296]

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36989) Migrate type hint data tests

2021-10-26 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-36989:
--

Assignee: Maciej Szymkiewicz

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37110) Add Java 17 support for spark pull request builds

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434190#comment-17434190
 ] 

Dongjoon Hyun commented on SPARK-37110:
---

+1 for [~hyukjin.kwon]'s comment to save the community resources.

> Add Java 17 support for spark pull request builds
> -
>
> Key: SPARK-37110
> URL: https://issues.apache.org/jira/browse/SPARK-37110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37109) Install Java 17 on all of the Jenkins workers

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434189#comment-17434189
 ] 

Dongjoon Hyun commented on SPARK-37109:
---

+1 for [~hyukjin.kwon]'s comment.

> Install Java 17 on all of the Jenkins workers
> -
>
> Key: SPARK-37109
> URL: https://issues.apache.org/jira/browse/SPARK-37109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37116) Allow sequences (tuples and lists) as pivot values argument in PySpark

2021-10-26 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434183#comment-17434183
 ] 

Maciej Szymkiewicz commented on SPARK-37116:


Sadly, this is not going to work. For example, this will typecheck, although 
incorrect. {{Tuple | List}} might work, but this probably more general problem 
in how we interact with JVM, not limited to typing issues.

 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

(spark.read
.csv("foo.csv")
.groupBy("foo")
.pivot("bar", "baz")
.sum()) {code}

> Allow sequences (tuples and lists) as pivot values argument in PySpark
> --
>
> Key: SPARK-37116
> URL: https://issues.apache.org/jira/browse/SPARK-37116
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Minor
>
> Both tuples and lists are accepted by PySpark on runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434176#comment-17434176
 ] 

Dongjoon Hyun commented on SPARK-37049:
---

It's fixed now.

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37049:
-

Assignee: Weiwei Yang

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37049:
-

Assignee: (was: wwei)

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434173#comment-17434173
 ] 

Dongjoon Hyun commented on SPARK-37049:
---

Oh, sure. Sorry, [~wwei].

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: wwei
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers Managed Spark Clusters

2021-10-26 Thread Naga Vijayapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga Vijayapuram updated SPARK-37114:
-
Description: To be able to submit jobs to prominent cloud providers managed 
spark clusters, "spark-submit" can be enhanced. For example, to submit job to 
"google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud 
dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. 
Once this feature is accepted and prioritized, then it can be rolled out in 
current and future versions of spark and also back ported to a few previous 
versions. I can raise the pull request.  (was: To be able to submit jobs to 
prominent cloud provider managed spark clusters, "spark-submit" can be 
enhanced. For example, to submit job to "google cloud dataproc", the 
"spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." 
when "–master gcd://cluster-name" arg is used. Once this feature is accepted 
and prioritized, then it can be rolled out in current and future versions of 
spark and also back ported to a few previous versions. I can raise the pull 
request.)

> Support Submitting Jobs to Cloud Providers Managed Spark Clusters
> -
>
> Key: SPARK-37114
> URL: https://issues.apache.org/jira/browse/SPARK-37114
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Naga Vijayapuram
>Priority: Trivial
>
> To be able to submit jobs to prominent cloud providers managed spark 
> clusters, "spark-submit" can be enhanced. For example, to submit job to 
> "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud 
> dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is 
> used. Once this feature is accepted and prioritized, then it can be rolled 
> out in current and future versions of spark and also back ported to a few 
> previous versions. I can raise the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers Managed Spark Clusters

2021-10-26 Thread Naga Vijayapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga Vijayapuram updated SPARK-37114:
-
Summary: Support Submitting Jobs to Cloud Providers Managed Spark Clusters  
(was: Support Submitting Jobs to Cloud Providers)

> Support Submitting Jobs to Cloud Providers Managed Spark Clusters
> -
>
> Key: SPARK-37114
> URL: https://issues.apache.org/jira/browse/SPARK-37114
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Naga Vijayapuram
>Priority: Trivial
>
> To be able to submit jobs to prominent cloud provider managed spark clusters, 
> "spark-submit" can be enhanced. For example, to submit job to "google cloud 
> dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs 
> submit spark ..." when "–master gcd://cluster-name" arg is used. Once this 
> feature is accepted and prioritized, then it can be rolled out in current and 
> future versions of spark and also back ported to a few previous versions. I 
> can raise the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers

2021-10-26 Thread Naga Vijayapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga Vijayapuram updated SPARK-37114:
-
Description: To be able to submit jobs to prominent cloud provider managed 
spark clusters, "spark-submit" can be enhanced. For example, to submit job to 
"google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud 
dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. 
Once this feature is accepted and prioritized, then it can be rolled out in 
current and future versions of spark and also back ported to a few previous 
versions. I can raise the pull request.  (was: To be able to submit jobs to 
cloud providers, "spark-submit" can be enhanced. For example, to submit job to 
"google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud 
dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. 
Once this feature is accepted and prioritized, then it can be rolled out in 
current and future versions of spark and also back ported to a few previous 
versions. I can raise the pull request.)

> Support Submitting Jobs to Cloud Providers
> --
>
> Key: SPARK-37114
> URL: https://issues.apache.org/jira/browse/SPARK-37114
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Naga Vijayapuram
>Priority: Trivial
>
> To be able to submit jobs to prominent cloud provider managed spark clusters, 
> "spark-submit" can be enhanced. For example, to submit job to "google cloud 
> dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs 
> submit spark ..." when "–master gcd://cluster-name" arg is used. Once this 
> feature is accepted and prioritized, then it can be rolled out in current and 
> future versions of spark and also back ported to a few previous versions. I 
> can raise the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37098) Alter table properties should invalidate cache

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37098:
--
Fix Version/s: 3.1.3

> Alter table properties should invalidate cache
> --
>
> Key: SPARK-37098
> URL: https://issues.apache.org/jira/browse/SPARK-37098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> The table properties can change the behavior of wriing. e.g. the parquet 
> table with `parquet.compression`.
> If you execute the following SQL, we will get the file with snappy 
> compression rather than zstd.
> {code:java}
> CREATE TABLE t (c int) STORED AS PARQUET;
> // cache table metadata
> SELECT * FROM t;
> ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd');
> INSERT INTO TABLE t values(1);
> {code}
> So we should invalidate the table cache after alter table properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers

2021-10-26 Thread Naga Vijayapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga Vijayapuram updated SPARK-37114:
-
Description: To be able to submit jobs to cloud providers, "spark-submit" 
can be enhanced. For example, to submit job to "google cloud dataproc", the 
"spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." 
when "–master gcd://cluster-name" arg is used. Once this feature is accepted 
and prioritized, then it can be rolled out in current and future versions of 
spark and also back ported to a few previous versions. I can raise the pull 
request.  (was: To be able to submit jobs to cloud providers, `spark-submit` 
can be enhanced. For example, to submit job to `google dataproc`, the 
`spark-submit` can be enhanced to do this ...

`gcloud dataproc jobs submit spark ...` when `–master google-cloud-dataproc` 
arg is used.

Once this feature is accepted and prioritized, then it can be rolled out in 
current and future versions of spark and also back ported to previous versions. 
I can raise the pull request.)

> Support Submitting Jobs to Cloud Providers
> --
>
> Key: SPARK-37114
> URL: https://issues.apache.org/jira/browse/SPARK-37114
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Naga Vijayapuram
>Priority: Trivial
>
> To be able to submit jobs to cloud providers, "spark-submit" can be enhanced. 
> For example, to submit job to "google cloud dataproc", the "spark-submit" can 
> be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master 
> gcd://cluster-name" arg is used. Once this feature is accepted and 
> prioritized, then it can be rolled out in current and future versions of 
> spark and also back ported to a few previous versions. I can raise the pull 
> request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36890) Use default WebsocketPingInterval for Kubernetes watches

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36890:
--
Affects Version/s: (was: 3.0.3)
   (was: 3.1.2)
   (was: 3.1.1)
   (was: 3.0.2)
   (was: 2.4.8)
   (was: 2.4.7)
   (was: 3.0.1)
   (was: 2.4.6)
   (was: 3.1.0)
   (was: 2.4.5)
   (was: 2.4.4)
   (was: 2.4.3)
   (was: 2.4.2)
   (was: 2.3.4)
   (was: 2.4.1)
   (was: 2.3.3)
   (was: 2.3.2)
   (was: 2.3.1)
   (was: 2.4.0)
   (was: 2.3.0)
   (was: 3.0.0)
   3.3.0

> Use default WebsocketPingInterval for Kubernetes watches
> 
>
> Key: SPARK-36890
> URL: https://issues.apache.org/jira/browse/SPARK-36890
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Philipp Dallig
>Assignee: Philipp Dallig
>Priority: Major
> Fix For: 3.3.0
>
>
> If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have 
> set a tunnel timeout, the following error message is thrown exactly after 
> each timeout.
> {code}
> >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> at okio.RealBufferedSource.require(RealBufferedSource.java:61)
> at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
> at 
> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
> at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
> at 
> okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
> at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> This exception is quite annoying when working interactively with a paused 
> pySpark shell where the driver component runs locally but the executors run 
> in Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36890) Use default WebsocketPingInterval for Kubernetes watches

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36890:
--
Summary: Use default WebsocketPingInterval for Kubernetes watches  (was: 
Websocket timeouts to K8s-API)

> Use default WebsocketPingInterval for Kubernetes watches
> 
>
> Key: SPARK-36890
> URL: https://issues.apache.org/jira/browse/SPARK-36890
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 
> 3.1.1, 3.1.2
>Reporter: Philipp Dallig
>Assignee: Philipp Dallig
>Priority: Major
> Fix For: 3.3.0
>
>
> If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have 
> set a tunnel timeout, the following error message is thrown exactly after 
> each timeout.
> {code}
> >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> at okio.RealBufferedSource.require(RealBufferedSource.java:61)
> at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
> at 
> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
> at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
> at 
> okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
> at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> This exception is quite annoying when working interactively with a paused 
> pySpark shell where the driver component runs locally but the executors run 
> in Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36890) Websocket timeouts to K8s-API

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36890.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34143
[https://github.com/apache/spark/pull/34143]

> Websocket timeouts to K8s-API
> -
>
> Key: SPARK-36890
> URL: https://issues.apache.org/jira/browse/SPARK-36890
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 
> 3.1.1, 3.1.2
>Reporter: Philipp Dallig
>Assignee: Philipp Dallig
>Priority: Major
> Fix For: 3.3.0
>
>
> If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have 
> set a tunnel timeout, the following error message is thrown exactly after 
> each timeout.
> {code}
> >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> at okio.RealBufferedSource.require(RealBufferedSource.java:61)
> at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
> at 
> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
> at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
> at 
> okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
> at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> This exception is quite annoying when working interactively with a paused 
> pySpark shell where the driver component runs locally but the executors run 
> in Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36890) Websocket timeouts to K8s-API

2021-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36890:
-

Assignee: Philipp Dallig

> Websocket timeouts to K8s-API
> -
>
> Key: SPARK-36890
> URL: https://issues.apache.org/jira/browse/SPARK-36890
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 
> 3.1.1, 3.1.2
>Reporter: Philipp Dallig
>Assignee: Philipp Dallig
>Priority: Major
>
> If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have 
> set a tunnel timeout, the following error message is thrown exactly after 
> each timeout.
> {code}
> >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> at okio.RealBufferedSource.require(RealBufferedSource.java:61)
> at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
> at 
> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
> at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
> at 
> okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
> at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> This exception is quite annoying when working interactively with a paused 
> pySpark shell where the driver component runs locally but the executors run 
> in Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37110) Add Java 17 support for spark pull request builds

2021-10-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-37110:

Summary: Add Java 17 support for spark pull request builds  (was: Add 
java17 support for spark pull request builds)

> Add Java 17 support for spark pull request builds
> -
>
> Key: SPARK-37110
> URL: https://issues.apache.org/jira/browse/SPARK-37110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37087) merge three relation resolutions into one

2021-10-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37087:
---

Assignee: Wenchen Fan

> merge three relation resolutions into one
> -
>
> Key: SPARK-37087
> URL: https://issues.apache.org/jira/browse/SPARK-37087
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37087) merge three relation resolutions into one

2021-10-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37087.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34358
[https://github.com/apache/spark/pull/34358]

> merge three relation resolutions into one
> -
>
> Key: SPARK-37087
> URL: https://issues.apache.org/jira/browse/SPARK-37087
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org