[jira] [Updated] (SPARK-32247) scipy installation fails with PyPy

2020-10-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32247:
-
Fix Version/s: 3.0.2
   2.4.8

> scipy installation fails with PyPy
> --
>
> Key: SPARK-32247
> URL: https://issues.apache.org/jira/browse/SPARK-32247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> PyPy also supports scipy to install. We have a few dependent PySpark test 
> cases.
> However, it fails in Github Actions environment. We should install it and 
> test it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33213) Upgrade Apache Arrow to 2.0.0

2020-10-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218743#comment-17218743
 ] 

Hyukjin Kwon commented on SPARK-33213:
--

cc [~bryanc] FYI

> Upgrade Apache Arrow to 2.0.0
> -
>
> Key: SPARK-33213
> URL: https://issues.apache.org/jira/browse/SPARK-33213
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Minor
>
> Apache Arrow 2.0.0 has [just been 
> released|https://cwiki.apache.org/confluence/display/ARROW/Arrow+2.0.0+Release].
>  This proposes to upgrade Spark's Arrow dependency to use 2.0.0, from the 
> current 1.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218734#comment-17218734
 ] 

Apache Spark commented on SPARK-33217:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30128

> Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33217
> URL: https://issues.apache.org/jira/browse/SPARK-33217
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33217:


Assignee: Apache Spark

> Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33217
> URL: https://issues.apache.org/jira/browse/SPARK-33217
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33217:


Assignee: (was: Apache Spark)

> Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33217
> URL: https://issues.apache.org/jira/browse/SPARK-33217
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218727#comment-17218727
 ] 

Hyukjin Kwon edited comment on SPARK-33189 at 10/22/20, 4:26 AM:
-

This was reverted in branch-2.4 at 
https://github.com/apache/spark/commit/a39a0963cbac0b51388023479a8a60e0a8b924d0 
and 
https://github.com/apache/spark/commit/88a3110c367c89a7b4931a3ab13ec91cdf0bcc41.
 See SPARK-33217


was (Author: hyukjin.kwon):
This was reverted at 
https://github.com/apache/spark/commit/a39a0963cbac0b51388023479a8a60e0a8b924d0 
and 
https://github.com/apache/spark/commit/88a3110c367c89a7b4931a3ab13ec91cdf0bcc41.
 See SPARK-33217

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33190) Set upperbound of PyArrow version in GitHub Actions

2020-10-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33190:
-
Fix Version/s: (was: 2.4.8)

> Set upperbound of PyArrow version in GitHub Actions
> ---
>
> Key: SPARK-33190
> URL: https://issues.apache.org/jira/browse/SPARK-33190
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> See SPARK-33189. Some tests look being failed with PyArrow 2.0.0+. We should 
> make the tests pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33190) Set upperbound of PyArrow version in GitHub Actions

2020-10-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218730#comment-17218730
 ] 

Hyukjin Kwon commented on SPARK-33190:
--

This was reverted in branch-2.4. See SPARK-33217

> Set upperbound of PyArrow version in GitHub Actions
> ---
>
> Key: SPARK-33190
> URL: https://issues.apache.org/jira/browse/SPARK-33190
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> See SPARK-33189. Some tests look being failed with PyArrow 2.0.0+. We should 
> make the tests pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33189:
-
Fix Version/s: (was: 2.4.8)

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218727#comment-17218727
 ] 

Hyukjin Kwon commented on SPARK-33189:
--

This was reverted at 
https://github.com/apache/spark/commit/a39a0963cbac0b51388023479a8a60e0a8b924d0 
and 
https://github.com/apache/spark/commit/88a3110c367c89a7b4931a3ab13ec91cdf0bcc41.
 See SPARK-33217

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4

2020-10-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33217:


 Summary: Set upper bound of Pandas and PyArrow version in GitHub 
Actions in branch-2.4
 Key: SPARK-33217
 URL: https://issues.apache.org/jira/browse/SPARK-33217
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.4.8
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4

2020-10-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33217:
-
Reporter: Hyukjin Kwon  (was: Dongjoon Hyun)

> Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33217
> URL: https://issues.apache.org/jira/browse/SPARK-33217
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-33212.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29843
[https://github.com/apache/spark/pull/29843]

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-33212:
---

Assignee: Chao Sun

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-33216.
-

> Set upper bound of Pandas version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33216
> URL: https://issues.apache.org/jira/browse/SPARK-33216
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33216.
---
Resolution: Duplicate

> Set upper bound of Pandas version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33216
> URL: https://issues.apache.org/jira/browse/SPARK-33216
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33210.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30121
[https://github.com/apache/spark/pull/30121]

> Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
> --
>
> Key: SPARK-33210
> URL: https://issues.apache.org/jira/browse/SPARK-33210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The ticket aims to set the following SQL configs:
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> to EXCEPTION by default.
> The reason is let users to decide should Spark modify loaded/saved timestamps 
> instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33210:
---

Assignee: Maxim Gekk

> Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
> --
>
> Key: SPARK-33210
> URL: https://issues.apache.org/jira/browse/SPARK-33210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The ticket aims to set the following SQL configs:
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> to EXCEPTION by default.
> The reason is let users to decide should Spark modify loaded/saved timestamps 
> instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33216:


Assignee: (was: Apache Spark)

> Set upper bound of Pandas version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33216
> URL: https://issues.apache.org/jira/browse/SPARK-33216
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218705#comment-17218705
 ] 

Apache Spark commented on SPARK-33216:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30127

> Set upper bound of Pandas version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33216
> URL: https://issues.apache.org/jira/browse/SPARK-33216
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33216:


Assignee: Apache Spark

> Set upper bound of Pandas version in GitHub Actions in branch-2.4
> -
>
> Key: SPARK-33216
> URL: https://issues.apache.org/jira/browse/SPARK-33216
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4

2020-10-21 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33216:
-

 Summary: Set upper bound of Pandas version in GitHub Actions in 
branch-2.4
 Key: SPARK-33216
 URL: https://issues.apache.org/jira/browse/SPARK-33216
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.4.8
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218683#comment-17218683
 ] 

Apache Spark commented on SPARK-33215:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30126

> Speed up event log download by skipping UI rebuild
> --
>
> Key: SPARK-33215
> URL: https://issues.apache.org/jira/browse/SPARK-33215
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> Right now, when we want to download the event logs from the spark history 
> server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
> this is just for view permission checks. UI rebuilding is a time-consuming 
> and memory-intensive task, especially for large logs. However, this process 
> is unnecessary for event log download.
> This patch enables SHS to check UI view permissions of a given app/attempt 
> for a given user, without rebuilding the UI. This is achieved by adding a 
> method "checkUIViewPermissions(appId: String, attemptId: Option[String], 
> user: String): Boolean" to many layers of history server components.
> With this patch, UI rebuild can be skipped when downloading event logs from 
> the history server. Thus the time of downloading a GB scale event log can be 
> reduced from several minutes to several seconds, and the memory consumption 
> of UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33215:


Assignee: Apache Spark

> Speed up event log download by skipping UI rebuild
> --
>
> Key: SPARK-33215
> URL: https://issues.apache.org/jira/browse/SPARK-33215
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Baohe Zhang
>Assignee: Apache Spark
>Priority: Major
>
> Right now, when we want to download the event logs from the spark history 
> server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
> this is just for view permission checks. UI rebuilding is a time-consuming 
> and memory-intensive task, especially for large logs. However, this process 
> is unnecessary for event log download.
> This patch enables SHS to check UI view permissions of a given app/attempt 
> for a given user, without rebuilding the UI. This is achieved by adding a 
> method "checkUIViewPermissions(appId: String, attemptId: Option[String], 
> user: String): Boolean" to many layers of history server components.
> With this patch, UI rebuild can be skipped when downloading event logs from 
> the history server. Thus the time of downloading a GB scale event log can be 
> reduced from several minutes to several seconds, and the memory consumption 
> of UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33215:


Assignee: (was: Apache Spark)

> Speed up event log download by skipping UI rebuild
> --
>
> Key: SPARK-33215
> URL: https://issues.apache.org/jira/browse/SPARK-33215
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> Right now, when we want to download the event logs from the spark history 
> server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
> this is just for view permission checks. UI rebuilding is a time-consuming 
> and memory-intensive task, especially for large logs. However, this process 
> is unnecessary for event log download.
> This patch enables SHS to check UI view permissions of a given app/attempt 
> for a given user, without rebuilding the UI. This is achieved by adding a 
> method "checkUIViewPermissions(appId: String, attemptId: Option[String], 
> user: String): Boolean" to many layers of history server components.
> With this patch, UI rebuild can be skipped when downloading event logs from 
> the history server. Thus the time of downloading a GB scale event log can be 
> reduced from several minutes to several seconds, and the memory consumption 
> of UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33203:
-

Assignee: Alessandro Patti  (was: Apache Spark)

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Assignee: Alessandro Patti
>Priority: Minor
> Fix For: 3.1.0
>
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33203.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30104
[https://github.com/apache/spark/pull/30104]

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-21 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-33215:
---

 Summary: Speed up event log download by skipping UI rebuild
 Key: SPARK-33215
 URL: https://issues.apache.org/jira/browse/SPARK-33215
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1, 2.4.7
Reporter: Baohe Zhang


Right now, when we want to download the event logs from the spark history 
server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
this is just for view permission checks. UI rebuilding is a time-consuming and 
memory-intensive task, especially for large logs. However, this process is 
unnecessary for event log download.

This patch enables SHS to check UI view permissions of a given app/attempt for 
a given user, without rebuilding the UI. This is achieved by adding a method 
"checkUIViewPermissions(appId: String, attemptId: Option[String], user: 
String): Boolean" to many layers of history server components.

With this patch, UI rebuild can be skipped when downloading event logs from the 
history server. Thus the time of downloading a GB scale event log can be 
reduced from several minutes to several seconds, and the memory consumption of 
UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218675#comment-17218675
 ] 

Apache Spark commented on SPARK-33189:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30125

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218674#comment-17218674
 ] 

Apache Spark commented on SPARK-33189:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30124

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19297) Add ability for --packages tag to pull latest version

2020-10-21 Thread Aoyuan Liao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aoyuan Liao resolved SPARK-19297.
-
Target Version/s: 3.0.1
  Resolution: Fixed

> Add ability for --packages tag to pull latest version
> -
>
> Key: SPARK-19297
> URL: https://issues.apache.org/jira/browse/SPARK-19297
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.1.0
>Reporter: Steven Landes
>Priority: Minor
>  Labels: features, newbie
> Attachments: packages_latest.txt
>
>
> It would be super-convenient, in a development environment, to be able to use 
> the --packages argument to point spark to the latest version of a package 
> instead of specifying a specific version.
> For example, instead of the following:
> --packages com.databricks:spark-csv_2.11:1.5.0
> I could just put in this:
> --packages com.databricks:spark-csv_2.11:latest



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19297) Add ability for --packages tag to pull latest version

2020-10-21 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218660#comment-17218660
 ] 

Aoyuan Liao commented on SPARK-19297:
-

latest.release can be used.
{code:java}
//
bin/spark-submit --packages com.databricks:spark-csv_2.11:latest.release 
examples/src/main/python/pi.py 10bin/spark-submit --packages 
com.databricks:spark-csv_2.11:latest.release examples/src/main/python/pi.py 
10:: loading settings :: url = 
jar:file:/home/eve/repo/spark/assembly/target/scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlIvy
 Default Cache set to: /home/eve/.ivy2/cacheThe jars for the packages stored 
in: /home/eve/.ivy2/jarscom.databricks#spark-csv_2.11 added as a dependency:: 
resolving dependencies :: 
org.apache.spark#spark-submit-parent-a24e5fe4-814e-48d0-baf4-6ff489520dd4;1.0 
confs: [default] found com.databricks#spark-csv_2.11;1.5.0 in central [1.5.0] 
com.databricks#spark-csv_2.11;latest.release found 
org.apache.commons#commons-csv;1.1 in central found 
com.univocity#univocity-parsers;1.5.1 in centraldownloading 
https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.5.0/spark-csv_2.11-1.5.0.jar
 ... [SUCCESSFUL ] com.databricks#spark-csv_2.11;1.5.0!spark-csv_2.11.jar 
(87ms)downloading 
https://repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar
 ... [SUCCESSFUL ] org.apache.commons#commons-csv;1.1!commons-csv.jar 
(36ms)downloading 
https://repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parsers-1.5.1.jar
 ... [SUCCESSFUL ] com.univocity#univocity-parsers;1.5.1!univocity-parsers.jar 
(127ms):: resolution report :: resolve 1729ms :: artifacts dl 257ms :: modules 
in use: com.databricks#spark-csv_2.11;1.5.0 from central in [default] 
com.univocity#univocity-parsers;1.5.1 from central in [default] 
org.apache.commons#commons-csv;1.1 from central in [default] 
- |         
         |            modules            ||   artifacts   | |       conf       
| number| search|dwnlded|evicted|| number|dwnlded| 
- |      
default     |   3   |   3   |   3   |   0   ||   3   |   3   | 
-
{code}

> Add ability for --packages tag to pull latest version
> -
>
> Key: SPARK-19297
> URL: https://issues.apache.org/jira/browse/SPARK-19297
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.1.0
>Reporter: Steven Landes
>Priority: Minor
>  Labels: features, newbie
> Attachments: packages_latest.txt
>
>
> It would be super-convenient, in a development environment, to be able to use 
> the --packages argument to point spark to the latest version of a package 
> instead of specifying a specific version.
> For example, instead of the following:
> --packages com.databricks:spark-csv_2.11:1.5.0
> I could just put in this:
> --packages com.databricks:spark-csv_2.11:latest



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21529) Improve the error message for unsupported Uniontype

2020-10-21 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218653#comment-17218653
 ] 

Aoyuan Liao commented on SPARK-21529:
-

[~teabot] I think catalyst still doesn't support uniontype. But the table can 
be read and printed out in spark through directly loading from avro file(file  
=> dataframe). The error message seems at least clear to me. Would you mind 
elaborating about how the error message should be improved? Do you suggest 
indicating calalyst doesn't support uniontype?

> Improve the error message for unsupported Uniontype
> ---
>
> Key: SPARK-21529
> URL: https://issues.apache.org/jira/browse/SPARK-21529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: Qubole, DataBricks
>Reporter: Elliot West
>Priority: Major
>  Labels: hive, starter, uniontype
>
> We encounter errors when attempting to read Hive tables whose schema contains 
> the {{uniontype}}. It appears perhaps that Catalyst
> does not support the {{uniontype}} which renders this table unreadable by 
> Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive
> query engine, it is fully supported by the storage engine and also the Avro 
> data format, which we use for these tables. Therefore, I believe it is
> a valid, usable type construct that should be supported by Spark.
> We've attempted to read the table as follows:
> {code}
> spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' 
> limit 5").show
> val tblread = spark.read.table("etl.tbl")
> {code}
> But this always results in the same error message. The pertinent error 
> messages are as follows (full stack trace below):
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype ...
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '<' expecting
> {, '('}
> (line 1, pos 9)
> == SQL ==
> uniontype -^^^
> {code}
> h2. Full stack trace
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>>
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> 

[jira] [Updated] (SPARK-33197) Changes to spark.sql.analyzer.maxIterations do not take effect at runtime

2020-10-21 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33197:
-
Affects Version/s: (was: 3.0.0)
   3.0.2

> Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
> -
>
> Key: SPARK-33197
> URL: https://issues.apache.org/jira/browse/SPARK-33197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Yuning Zhang
>Priority: Major
>
> `spark.sql.analyzer.maxIterations` is not a static conf. However, changes to 
> it do not take effect at runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218619#comment-17218619
 ] 

Apache Spark commented on SPARK-33214:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/30122

> HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp 
> directory 
> ---
>
> Key: SPARK-33214
> URL: https://issues.apache.org/jira/browse/SPARK-33214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> In SPARK-22356, the {{sparkTestingDir}} used by 
> {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of 
> the downloaded Spark tarball between test executions:
> {code}
>   // For local test, you can set `sparkTestingDir` to a static value like 
> `/tmp/test-spark`, to
>   // avoid downloading Spark of different versions in each run.
>   private val sparkTestingDir = new File("/tmp/test-spark")
> {code}
> However this doesn't work, since it gets deleted every time:
> {code}
>   override def afterAll(): Unit = {
> try {
>   Utils.deleteRecursively(wareHousePath)
>   Utils.deleteRecursively(tmpDataDir)
>   Utils.deleteRecursively(sparkTestingDir)
> } finally {
>   super.afterAll()
> }
>   }
> {code}
> It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases 
> this is not the proper place to store temporary files. We're not currently 
> making any good use of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33214:


Assignee: (was: Apache Spark)

> HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp 
> directory 
> ---
>
> Key: SPARK-33214
> URL: https://issues.apache.org/jira/browse/SPARK-33214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> In SPARK-22356, the {{sparkTestingDir}} used by 
> {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of 
> the downloaded Spark tarball between test executions:
> {code}
>   // For local test, you can set `sparkTestingDir` to a static value like 
> `/tmp/test-spark`, to
>   // avoid downloading Spark of different versions in each run.
>   private val sparkTestingDir = new File("/tmp/test-spark")
> {code}
> However this doesn't work, since it gets deleted every time:
> {code}
>   override def afterAll(): Unit = {
> try {
>   Utils.deleteRecursively(wareHousePath)
>   Utils.deleteRecursively(tmpDataDir)
>   Utils.deleteRecursively(sparkTestingDir)
> } finally {
>   super.afterAll()
> }
>   }
> {code}
> It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases 
> this is not the proper place to store temporary files. We're not currently 
> making any good use of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33214:


Assignee: Apache Spark

> HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp 
> directory 
> ---
>
> Key: SPARK-33214
> URL: https://issues.apache.org/jira/browse/SPARK-33214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-22356, the {{sparkTestingDir}} used by 
> {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of 
> the downloaded Spark tarball between test executions:
> {code}
>   // For local test, you can set `sparkTestingDir` to a static value like 
> `/tmp/test-spark`, to
>   // avoid downloading Spark of different versions in each run.
>   private val sparkTestingDir = new File("/tmp/test-spark")
> {code}
> However this doesn't work, since it gets deleted every time:
> {code}
>   override def afterAll(): Unit = {
> try {
>   Utils.deleteRecursively(wareHousePath)
>   Utils.deleteRecursively(tmpDataDir)
>   Utils.deleteRecursively(sparkTestingDir)
> } finally {
>   super.afterAll()
> }
>   }
> {code}
> It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases 
> this is not the proper place to store temporary files. We're not currently 
> making any good use of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory

2020-10-21 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-33214:
---

 Summary: HiveExternalCatalogVersionsSuite shouldn't use or delete 
hard-coded /tmp directory 
 Key: SPARK-33214
 URL: https://issues.apache.org/jira/browse/SPARK-33214
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.1
Reporter: Erik Krogen


In SPARK-22356, the {{sparkTestingDir}} used by 
{{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of the 
downloaded Spark tarball between test executions:
{code}
  // For local test, you can set `sparkTestingDir` to a static value like 
`/tmp/test-spark`, to
  // avoid downloading Spark of different versions in each run.
  private val sparkTestingDir = new File("/tmp/test-spark")
{code}
However this doesn't work, since it gets deleted every time:
{code}
  override def afterAll(): Unit = {
try {
  Utils.deleteRecursively(wareHousePath)
  Utils.deleteRecursively(tmpDataDir)
  Utils.deleteRecursively(sparkTestingDir)
} finally {
  super.afterAll()
}
  }
{code}

It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases this 
is not the proper place to store temporary files. We're not currently making 
any good use of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33202) Fix BlockManagerDecommissioner to return the correct migration status

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33202.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30116
[https://github.com/apache/spark/pull/30116]

> Fix BlockManagerDecommissioner to return the correct migration status
> -
>
> Key: SPARK-33202
> URL: https://issues.apache.org/jira/browse/SPARK-33202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33202) Fix BlockManagerDecommissioner to return the correct migration status

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33202:
-

Assignee: Dongjoon Hyun

> Fix BlockManagerDecommissioner to return the correct migration status
> -
>
> Key: SPARK-33202
> URL: https://issues.apache.org/jira/browse/SPARK-33202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33213) Upgrade Apache Arrow to 2.0.0

2020-10-21 Thread Chao Sun (Jira)
Chao Sun created SPARK-33213:


 Summary: Upgrade Apache Arrow to 2.0.0
 Key: SPARK-33213
 URL: https://issues.apache.org/jira/browse/SPARK-33213
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 3.0.1
Reporter: Chao Sun


Apache Arrow 2.0.0 has [just been 
released|https://cwiki.apache.org/confluence/display/ARROW/Arrow+2.0.0+Release].
 This proposes to upgrade Spark's Arrow dependency to use 2.0.0, from the 
current 1.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33212:


Assignee: (was: Apache Spark)

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33212:


Assignee: Apache Spark

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218588#comment-17218588
 ] 

Apache Spark commented on SPARK-33212:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29843

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29250) Upgrade to Hadoop 3.2.2

2020-10-21 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-29250:
-
Summary: Upgrade to Hadoop 3.2.2  (was: Upgrade to Hadoop 3.2.1 and move to 
shaded client)

> Upgrade to Hadoop 3.2.2
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Chao Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread Chao Sun (Jira)
Chao Sun created SPARK-33212:


 Summary: Move to shaded clients for Hadoop 3.x profile
 Key: SPARK-33212
 URL: https://issues.apache.org/jira/browse/SPARK-33212
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Spark Submit, SQL, YARN
Affects Versions: 3.0.1
Reporter: Chao Sun


Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
hadoop-common, hadoop-client etc. Benefits include:
 * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions 
of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, 
Spark depends on Hadoop to not leaking dependencies.
 * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
client-side and server-side Hadoop APIs from modules such as hadoop-common, 
hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
use public/client API from Hadoop side.
 * Provides a better isolation from Hadoop dependencies. In future Spark can 
better evolve without worrying about dependencies pulled from Hadoop side 
(which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33064) Spark-shell does not display accented chara

2020-10-21 Thread Laurent GUEMAPPE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laurent GUEMAPPE updated SPARK-33064:
-
Description: 
It seems to be a duplicate of *FLEX-18425*, which is duplicate of SDK-17398 
that does not exist anymore. But the bug remains.

(1) I create a txt file "café.txt" that contains two lines : 
{quote}Café

Café
{quote}
(2) I type the following command :

*spark.read.csv("café.txt").show()*

It is displayed as following :

*spark.read.csv("caf.txt").show()*

But it works and it returns this : 
{quote}+-+
  |   _c0|
 +-+
  |  Caf|
  |Café|
 +-+
{quote}
We notice a shift after "Caf" and "Café".

(3) The two following commands works. The written textfiles have the same 
content as "café.txt" 

*spark.read.csv("café.txt").write.format("text").save("café2")*

*sc.textFile("café.txt").saveAsTextFile("café3")*

 

Once again, the Spark-shell display this : 

*spark.read.csv("caf.txt").write.format("text").save("caf2")*

*sc.textFile("caf.txt").saveAsTextFile("caf3")*

 

(4)If I type 7 "é" an then 7 Backspace, by using the "é" key of my french 
keyboard, then the scala prompt disappears. I have a new prompt when I type 
Return.

 

The issue (4) as well as the shift in (2) seem to be related to the difference 
between counted characters and displayed characters.

 

(5) I notice that I haven't got this issue by launching Spark from Ubuntu, 
thanks to "Windows Subsystem for Linux" Version 2.

  was:
It seems to be a duplicate of *FLEX-18425*, which is duplicate of SDK-17398 
that does not exist anymore. But the bug remains.

(1) I create a txt file "café.txt" that contains two lines : 
{quote}Café

Café
{quote}
(2) I type the following command :

*spark.read.csv("café.txt").show()*

It is displayed as following :

*spark.read.csv("caf.txt").show()*

But it works and it returns this : 
{quote}+-+
 |   _c0|
+-+
 |  Caf|
 |Café|
+-+
{quote}
We notice a shift after "Caf" and "Café".

(3) The two following commands works. The written textfiles have the same 
content as "café.txt" 

*spark.read.csv("café.txt").write.format("text").save("café2")*

*sc.textFile("café.txt").saveAsTextFile("café3")*

 

Once again, the Spark-shell display this : 

*spark.read.csv("caf.txt").write.format("text").save("caf2")*

*sc.textFile("caf.txt").saveAsTextFile("caf3")*

 

(4)If I type 7 "é" an then 7 Backspace, by using the "é" key of my french 
keyboard, then the scala prompt disappears. I have a new prompt when I type 
Return.

 

The issue (4) as well as the shift in (2) seem to be related to the difference 
between counted characters and displayed characters.


> Spark-shell does not display accented chara
> ---
>
> Key: SPARK-33064
> URL: https://issues.apache.org/jira/browse/SPARK-33064
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.1
> Environment: Windows 10
> "Beta: Use Unicode UTF-8 for worldwide language support" has been checked.
>Reporter: Laurent GUEMAPPE
>Priority: Minor
>
> It seems to be a duplicate of *FLEX-18425*, which is duplicate of SDK-17398 
> that does not exist anymore. But the bug remains.
> (1) I create a txt file "café.txt" that contains two lines : 
> {quote}Café
> Café
> {quote}
> (2) I type the following command :
> *spark.read.csv("café.txt").show()*
> It is displayed as following :
> *spark.read.csv("caf.txt").show()*
> But it works and it returns this : 
> {quote}+-+
>   |   _c0|
>  +-+
>   |  Caf|
>   |Café|
>  +-+
> {quote}
> We notice a shift after "Caf" and "Café".
> (3) The two following commands works. The written textfiles have the same 
> content as "café.txt" 
> *spark.read.csv("café.txt").write.format("text").save("café2")*
> *sc.textFile("café.txt").saveAsTextFile("café3")*
>  
> Once again, the Spark-shell display this : 
> *spark.read.csv("caf.txt").write.format("text").save("caf2")*
> *sc.textFile("caf.txt").saveAsTextFile("caf3")*
>  
> (4)If I type 7 "é" an then 7 Backspace, by using the "é" key of my french 
> keyboard, then the scala prompt disappears. I have a new prompt when I type 
> Return.
>  
> The issue (4) as well as the shift in (2) seem to be related to the 
> difference between counted characters and displayed characters.
>  
> (5) I notice that I haven't got this issue by launching Spark from Ubuntu, 
> thanks to "Windows Subsystem for Linux" Version 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33211) Early state store eviction for left semi stream-stream join

2020-10-21 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-33211:
-
Parent: SPARK-32883
Issue Type: Sub-task  (was: Improvement)

> Early state store eviction for left semi stream-stream join
> ---
>
> Key: SPARK-33211
> URL: https://issues.apache.org/jira/browse/SPARK-33211
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> As a followup from discussion 
> [https://github.com/apache/spark/pull/30076/files/3918727a08c8d0d4c65ccc8ea902f77051b78b1d#r508926034]
>  and [https://github.com/apache/spark/pull/30076#discussion_r509222802] , for 
> left semi stream-stream join, the matched left side rows can be evicted from 
> left state store immediately, without waiting for to be below watermark. 
> However it needs more thought for how to implement efficiently to not iterate 
> all values if watermark predicate is on key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33211) Early state store eviction for left semi stream-stream join

2020-10-21 Thread Cheng Su (Jira)
Cheng Su created SPARK-33211:


 Summary: Early state store eviction for left semi stream-stream 
join
 Key: SPARK-33211
 URL: https://issues.apache.org/jira/browse/SPARK-33211
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Cheng Su


As a followup from discussion 
[https://github.com/apache/spark/pull/30076/files/3918727a08c8d0d4c65ccc8ea902f77051b78b1d#r508926034]
 and [https://github.com/apache/spark/pull/30076#discussion_r509222802] , for 
left semi stream-stream join, the matched left side rows can be evicted from 
left state store immediately, without waiting for to be below watermark. 
However it needs more thought for how to implement efficiently to not iterate 
all values if watermark predicate is on key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218564#comment-17218564
 ] 

Apache Spark commented on SPARK-33210:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30121

> Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
> --
>
> Key: SPARK-33210
> URL: https://issues.apache.org/jira/browse/SPARK-33210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The ticket aims to set the following SQL configs:
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> to EXCEPTION by default.
> The reason is let users to decide should Spark modify loaded/saved timestamps 
> instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218563#comment-17218563
 ] 

Apache Spark commented on SPARK-33210:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30121

> Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
> --
>
> Key: SPARK-33210
> URL: https://issues.apache.org/jira/browse/SPARK-33210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The ticket aims to set the following SQL configs:
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> to EXCEPTION by default.
> The reason is let users to decide should Spark modify loaded/saved timestamps 
> instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33210:


Assignee: Apache Spark

> Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
> --
>
> Key: SPARK-33210
> URL: https://issues.apache.org/jira/browse/SPARK-33210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The ticket aims to set the following SQL configs:
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> to EXCEPTION by default.
> The reason is let users to decide should Spark modify loaded/saved timestamps 
> instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33210:


Assignee: (was: Apache Spark)

> Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
> --
>
> Key: SPARK-33210
> URL: https://issues.apache.org/jira/browse/SPARK-33210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The ticket aims to set the following SQL configs:
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> to EXCEPTION by default.
> The reason is let users to decide should Spark modify loaded/saved timestamps 
> instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33210:
--

 Summary: Set the rebasing mode for parquet INT96 type to 
`EXCEPTION` by default
 Key: SPARK-33210
 URL: https://issues.apache.org/jira/browse/SPARK-33210
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The ticket aims to set the following SQL configs:
- spark.sql.legacy.parquet.int96RebaseModeInWrite
- spark.sql.legacy.parquet.int96RebaseModeInRead
to EXCEPTION by default.

The reason is let users to decide should Spark modify loaded/saved timestamps 
instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33205) Bump snappy-java version to 1.1.8

2020-10-21 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-33205:
---

Assignee: Takeshi Yamamuro

> Bump snappy-java version to 1.1.8
> -
>
> Key: SPARK-33205
> URL: https://issues.apache.org/jira/browse/SPARK-33205
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33205) Bump snappy-java version to 1.1.8

2020-10-21 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-33205.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30120
[https://github.com/apache/spark/pull/30120]

> Bump snappy-java version to 1.1.8
> -
>
> Key: SPARK-33205
> URL: https://issues.apache.org/jira/browse/SPARK-33205
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala

2020-10-21 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-33209:
-
Parent: SPARK-32883
Issue Type: Sub-task  (was: Improvement)

> Clean up unit test file UnsupportedOperationsSuite.scala
> 
>
> Key: SPARK-33209
> URL: https://issues.apache.org/jira/browse/SPARK-33209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As a follow up from [https://github.com/apache/spark/pull/30076,] there are 
> many copy-paste in the unit test file UnsupportedOperationsSuite.scala to 
> check different join types (inner, outer, semi) with similar code structure. 
> It would be helpful to clean them up and refactor to reuse code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala

2020-10-21 Thread Cheng Su (Jira)
Cheng Su created SPARK-33209:


 Summary: Clean up unit test file UnsupportedOperationsSuite.scala
 Key: SPARK-33209
 URL: https://issues.apache.org/jira/browse/SPARK-33209
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.1.0
Reporter: Cheng Su


As a follow up from [https://github.com/apache/spark/pull/30076,] there are 
many copy-paste in the unit test file UnsupportedOperationsSuite.scala to check 
different join types (inner, outer, semi) with similar code structure. It would 
be helpful to clean them up and refactor to reuse code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33207) Reduce the number of tasks launched after bucket pruning

2020-10-21 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218545#comment-17218545
 ] 

Cheng Su commented on SPARK-33207:
--

Thank [~yumwang] for bringing up the issue. We don't need to launch 
#-of-buckets tasks if the bucket filter pruning is taking effect 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L570]
 ). However, if the query has join on these bucketed tables, we still need 
launch these many tasks to maintain bucketed table scan's outputPartitioning 
property. So the decision of whether to launch fewer tasks, depend on query 
shape. A physical plan rule should resolve the issue but I am not sure whether 
it worth the effort.

> Reduce the number of tasks launched after bucket pruning
> 
>
> Key: SPARK-33207
> URL: https://issues.apache.org/jira/browse/SPARK-33207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We only need to read 1 bucket, but it still launch 200 tasks.
> {code:sql}
> create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
> 200 buckets AS (SELECT id FROM range(1000) cluster by id)
> spark-sql> explain select * from test_bucket where id = 4;
> == Physical Plan ==
> *(1) Project [id#7L]
> +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
> DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
> ReadSchema: struct, SelectedBucketsCount: 1 out of 200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33197) Changes to spark.sql.analyzer.maxIterations do not take effect at runtime

2020-10-21 Thread Yuning Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuning Zhang updated SPARK-33197:
-
Affects Version/s: 3.0.0

> Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
> -
>
> Key: SPARK-33197
> URL: https://issues.apache.org/jira/browse/SPARK-33197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Yuning Zhang
>Priority: Major
>
> `spark.sql.analyzer.maxIterations` is not a static conf. However, changes to 
> it do not take effect at runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13493) json to DataFrame to parquet does not respect case sensitiveness

2020-10-21 Thread Hariharan Karthikeyan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-13493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hariharan Karthikeyan updated SPARK-13493:
--
Comment: was deleted

(was:  !Screen Shot 2020-10-21 at 11.02.34 AM.png! )

> json to DataFrame to parquet does not respect case sensitiveness
> 
>
> Key: SPARK-13493
> URL: https://issues.apache.org/jira/browse/SPARK-13493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: bulk-closed
> Attachments: Screen Shot 2020-10-21 at 11.02.34 AM.png
>
>
> Not sure where the problem should be fixed exactly but here it is:
> {noformat}
> $ spark-shell --conf spark.sql.caseSensitive=false
> scala> sqlContext.getConf("spark.sql.caseSensitive")
> res2: String = false
> scala> val data = List("""{"field": 1}""","""{"field": 2}""","""{"field": 
> 3}""","""{"field": 4}""","""{"FIELD": 5}""")
> scala> val jsonDF = sqlContext.read.json(sc.parallelize(data))
> scala> jsonDF.printSchema
> root
>  |-- FIELD: long (nullable = true)
>  |-- field: long (nullable = true)
> {noformat}
> And when persisting this as parquet:
> {noformat}
> scala> jsonDF.write.parquet("out")
> org.apache.spark.sql.AnalysisException: Reference 'FIELD' is ambiguous, could 
> be: FIELD#0L, FIELD#1L.;
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.sc
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> 

[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache

2020-10-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218374#comment-17218374
 ] 

Nicholas Chammas commented on SPARK-26764:
--

The SPIP PDF references a design doc, but I'm not clear on where the design doc 
actually is. Is this issue supposed to be linked to some other ones?

Also, appendix B suggests to me that this idea would mesh well with the 
existing proposals to support materialized views. I could actually see this as 
an enhancement to those proposals, like SPARK-29038.

In fact, when I look at the [design 
doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit#]
 for SPARK-29038, I see that goal 3 covers automatic query rewrites, which I 
think subsumes the main benefit of this proposal as compared to "traditional" 
materialized views.
{quote}> 3. A query _rewrite_ capability to transparently rewrite a query to 
use a materialized view[1][2].
 > a. Query rewrite capability is transparent to SQL applications.
 > b. Query rewrite can be disabled at the system level or on individual 
 > materialized view. Also it can be disabled for a specified query via hint.
 > c. Query rewrite as a rule in optimizer should be made sure that it won’t 
 > cause performance regression if it can use other index or cache.
{quote}

> [SPIP] Spark Relational Cache
> -
>
> Key: SPARK-26764
> URL: https://issues.apache.org/jira/browse/SPARK-26764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Adrian Wang
>Priority: Major
> Attachments: Relational+Cache+SPIP.pdf
>
>
> In modern database systems, relational cache is a common technology to boost 
> ad-hoc queries. While Spark provides cache natively, Spark SQL should be able 
> to utilize the relationship between relations to boost all possible queries. 
> In this SPIP, we will make Spark be able to utilize all defined cached 
> relations if possible, without explicit substitution in user query, as well 
> as keep some user defined cache available in different sessions. Materialized 
> views in many database systems provide similar function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13493) json to DataFrame to parquet does not respect case sensitiveness

2020-10-21 Thread Hariharan Karthikeyan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218342#comment-17218342
 ] 

Hariharan Karthikeyan commented on SPARK-13493:
---

 !Screen Shot 2020-10-21 at 11.02.34 AM.png! 

> json to DataFrame to parquet does not respect case sensitiveness
> 
>
> Key: SPARK-13493
> URL: https://issues.apache.org/jira/browse/SPARK-13493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: bulk-closed
> Attachments: Screen Shot 2020-10-21 at 11.02.34 AM.png
>
>
> Not sure where the problem should be fixed exactly but here it is:
> {noformat}
> $ spark-shell --conf spark.sql.caseSensitive=false
> scala> sqlContext.getConf("spark.sql.caseSensitive")
> res2: String = false
> scala> val data = List("""{"field": 1}""","""{"field": 2}""","""{"field": 
> 3}""","""{"field": 4}""","""{"FIELD": 5}""")
> scala> val jsonDF = sqlContext.read.json(sc.parallelize(data))
> scala> jsonDF.printSchema
> root
>  |-- FIELD: long (nullable = true)
>  |-- field: long (nullable = true)
> {noformat}
> And when persisting this as parquet:
> {noformat}
> scala> jsonDF.write.parquet("out")
> org.apache.spark.sql.AnalysisException: Reference 'FIELD' is ambiguous, could 
> be: FIELD#0L, FIELD#1L.;
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.sc
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> 

[jira] [Updated] (SPARK-13493) json to DataFrame to parquet does not respect case sensitiveness

2020-10-21 Thread Hariharan Karthikeyan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-13493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hariharan Karthikeyan updated SPARK-13493:
--
Attachment: Screen Shot 2020-10-21 at 11.02.34 AM.png

> json to DataFrame to parquet does not respect case sensitiveness
> 
>
> Key: SPARK-13493
> URL: https://issues.apache.org/jira/browse/SPARK-13493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: bulk-closed
> Attachments: Screen Shot 2020-10-21 at 11.02.34 AM.png
>
>
> Not sure where the problem should be fixed exactly but here it is:
> {noformat}
> $ spark-shell --conf spark.sql.caseSensitive=false
> scala> sqlContext.getConf("spark.sql.caseSensitive")
> res2: String = false
> scala> val data = List("""{"field": 1}""","""{"field": 2}""","""{"field": 
> 3}""","""{"field": 4}""","""{"FIELD": 5}""")
> scala> val jsonDF = sqlContext.read.json(sc.parallelize(data))
> scala> jsonDF.printSchema
> root
>  |-- FIELD: long (nullable = true)
>  |-- field: long (nullable = true)
> {noformat}
> And when persisting this as parquet:
> {noformat}
> scala> jsonDF.write.parquet("out")
> org.apache.spark.sql.AnalysisException: Reference 'FIELD' is ambiguous, could 
> be: FIELD#0L, FIELD#1L.;
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.sc
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> 

[jira] [Created] (SPARK-33208) Update the document of SparkSession#sql

2020-10-21 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-33208:
---

 Summary: Update the document of SparkSession#sql
 Key: SPARK-33208
 URL: https://issues.apache.org/jira/browse/SPARK-33208
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Wenchen Fan


We should mention that this API eagerly runs DDL/DML commands, but not for 
SELECT queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2020-10-21 Thread Lars Francke (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Francke updated SPARK-33206:
-
Description: 
SPARK-21501 changed the spark shuffle index service to be based on memory 
instead of the number of files.

Unfortunately, there's a problem with the calculation which is based on size 
information provided by `ShuffleIndexInformation`.

It is based purely on the file size of the cached file on disk.

We're running in OOMs with very small index files (byte size ~16 bytes) but the 
overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 
bytes, see screenshot). We need to take this into account and should probably 
add a fixed overhead of somewhere between 152 and 180 bytes according to my 
tests. I'm not 100% sure what the correct number is and it'll also depend on 
the architecture etc. so we can't be exact anyway.

If we do that we can maybe get rid of the size field in ShuffleIndexInformation 
to save a few more bytes per entry.

In effect this means that for small files we use up about 70-100 times as much 
memory as we intend to. Our NodeManagers OOM with 4GB and more of 
indexShuffleCache.

 

 

  was:
dSPARK-21501 changed the spark shuffle index service to be based on memory 
instead of the number of files.

Unfortunately, there's a problem with the calculation which is based on size 
information provided by `ShuffleIndexInformation`.

It is based purely on the file size of the cached file on disk.

We're running in OOMs with very small index files (byte size ~16 bytes) but the 
overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 
bytes, see screenshot). We need to take this into account and should probably 
add a fixed overhead of somewhere between 152 and 180 bytes according to my 
tests. I'm not 100% sure what the correct number is and it'll also depend on 
the architecture etc. so we can't be exact anyway.

If we do that we can maybe get rid of the size field in ShuffleIndexInformation 
to save a few more bytes per entry.

In effect this means that for small files we use up about 70-100 times as much 
memory as we intend to. Our NodeManagers OOM with 4GB and more of 
indexShuffleCache.

 

 


> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Priority: Major
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2020-10-21 Thread Lars Francke (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218306#comment-17218306
 ] 

Lars Francke commented on SPARK-33206:
--

I used YourKit (thank you for the free license!) and it claims that 
ShuffleIndexInformation uses 152 byte of retained memory when it caches a 0 
byte file.

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Priority: Major
> Attachments: image001(1).png
>
>
> dSPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33207) Reduce the number of tasks launched after bucket pruning

2020-10-21 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33207:

Description: 
We only need to read 1 bucket, but it still launch 200 tasks.
{code:sql}
create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
200 buckets AS (SELECT id FROM range(1000) cluster by id)
spark-sql> explain select * from test_bucket where id = 4;
== Physical Plan ==
*(1) Project [id#7L]
+- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
 PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
ReadSchema: struct, SelectedBucketsCount: 1 out of 200
{code}

  was:
We only need to read 1 bucket, but still launch 200 tasks.

{code:sql}
create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
200 buckets AS (SELECT id FROM range(1000) cluster by id)
spark-sql> explain select * from test_bucket where id = 4;
== Physical Plan ==
*(1) Project [id#7L]
+- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
 PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
ReadSchema: struct, SelectedBucketsCount: 1 out of 200
{code}




> Reduce the number of tasks launched after bucket pruning
> 
>
> Key: SPARK-33207
> URL: https://issues.apache.org/jira/browse/SPARK-33207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We only need to read 1 bucket, but it still launch 200 tasks.
> {code:sql}
> create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
> 200 buckets AS (SELECT id FROM range(1000) cluster by id)
> spark-sql> explain select * from test_bucket where id = 4;
> == Physical Plan ==
> *(1) Project [id#7L]
> +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
> DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
> ReadSchema: struct, SelectedBucketsCount: 1 out of 200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33207) Reduce the number of tasks launched after bucket pruning

2020-10-21 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218290#comment-17218290
 ] 

Yuming Wang commented on SPARK-33207:
-

cc [~chengsu]

> Reduce the number of tasks launched after bucket pruning
> 
>
> Key: SPARK-33207
> URL: https://issues.apache.org/jira/browse/SPARK-33207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We only need to read 1 bucket, but it still launch 200 tasks.
> {code:sql}
> create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
> 200 buckets AS (SELECT id FROM range(1000) cluster by id)
> spark-sql> explain select * from test_bucket where id = 4;
> == Physical Plan ==
> *(1) Project [id#7L]
> +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
> DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
> ReadSchema: struct, SelectedBucketsCount: 1 out of 200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33207) Reduce the number of tasks launched after bucket pruning

2020-10-21 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33207:
---

 Summary: Reduce the number of tasks launched after bucket pruning
 Key: SPARK-33207
 URL: https://issues.apache.org/jira/browse/SPARK-33207
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


We only need to read 1 bucket, but still launch 200 tasks.

{code:sql}
create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
200 buckets AS (SELECT id FROM range(1000) cluster by id)
spark-sql> explain select * from test_bucket where id = 4;
== Physical Plan ==
*(1) Project [id#7L]
+- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
 PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
ReadSchema: struct, SelectedBucketsCount: 1 out of 200
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2020-10-21 Thread Lars Francke (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Francke updated SPARK-33206:
-
Description: 
dSPARK-21501 changed the spark shuffle index service to be based on memory 
instead of the number of files.

Unfortunately, there's a problem with the calculation which is based on size 
information provided by `ShuffleIndexInformation`.

It is based purely on the file size of the cached file on disk.

We're running in OOMs with very small index files (byte size ~16 bytes) but the 
overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 
bytes, see screenshot). We need to take this into account and should probably 
add a fixed overhead of somewhere between 152 and 180 bytes according to my 
tests. I'm not 100% sure what the correct number is and it'll also depend on 
the architecture etc. so we can't be exact anyway.

If we do that we can maybe get rid of the size field in ShuffleIndexInformation 
to save a few more bytes per entry.

In effect this means that for small files we use up about 70-100 times as much 
memory as we intend to. Our NodeManagers OOM with 4GB and more of 
indexShuffleCache.

 

 

  was:
SPARK-21501 changed the spark shuffle index service to be based on memory 
instead of the number of files.

Unfortunately, there's a problem with the calculation which is based on size 
information provided by `ShuffleIndexInformation`.

It is based purely on the file size of the cached file on disk.

We're running in OOMs with very small index files (byte size ~16 bytes) but the 
overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 
bytes, see screenshot). We need to take this into account and should probably 
add a fixed overhead of somewhere between 152 and 180 bytes according to my 
tests. I'm not 100% sure what the correct number is and it'll also depend on 
the architecture etc. so we can't be exact anyway.

If we do that we can maybe get rid of the size field in ShuffleIndexInformation 
to save a few more bytes per entry.

In effect this means that for small files we use up about 70-100 times as much 
memory as we intend to. Our NodeManagers OOM with 4GB and more of 
indexShuffleCache.

 

 


> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Priority: Major
> Attachments: image001(1).png
>
>
> dSPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based

2020-10-21 Thread Lars Francke (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218285#comment-17218285
 ] 

Lars Francke commented on SPARK-21501:
--

Just FYI for others stumbling across this: This has a bug in how the memory is 
calculated and might use way more than the 100MB it intends to.

See SPARK-33206 for details.

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Sanket Reddy
>Priority: Major
> Fix For: 2.3.0
>
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2020-10-21 Thread Lars Francke (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Francke updated SPARK-33206:
-
Attachment: image001(1).png

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Priority: Major
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2020-10-21 Thread Lars Francke (Jira)
Lars Francke created SPARK-33206:


 Summary: Spark Shuffle Index Cache calculates memory usage wrong
 Key: SPARK-33206
 URL: https://issues.apache.org/jira/browse/SPARK-33206
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.0.1, 2.4.0
Reporter: Lars Francke
 Attachments: image001(1).png

SPARK-21501 changed the spark shuffle index service to be based on memory 
instead of the number of files.

Unfortunately, there's a problem with the calculation which is based on size 
information provided by `ShuffleIndexInformation`.

It is based purely on the file size of the cached file on disk.

We're running in OOMs with very small index files (byte size ~16 bytes) but the 
overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 
bytes, see screenshot). We need to take this into account and should probably 
add a fixed overhead of somewhere between 152 and 180 bytes according to my 
tests. I'm not 100% sure what the correct number is and it'll also depend on 
the architecture etc. so we can't be exact anyway.

If we do that we can maybe get rid of the size field in ShuffleIndexInformation 
to save a few more bytes per entry.

In effect this means that for small files we use up about 70-100 times as much 
memory as we intend to. Our NodeManagers OOM with 4GB and more of 
indexShuffleCache.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33205) Bump snappy-java version to 1.1.8

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218267#comment-17218267
 ] 

Apache Spark commented on SPARK-33205:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30120

> Bump snappy-java version to 1.1.8
> -
>
> Key: SPARK-33205
> URL: https://issues.apache.org/jira/browse/SPARK-33205
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33205) Bump snappy-java version to 1.1.8

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33205:


Assignee: Apache Spark

> Bump snappy-java version to 1.1.8
> -
>
> Key: SPARK-33205
> URL: https://issues.apache.org/jira/browse/SPARK-33205
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33205) Bump snappy-java version to 1.1.8

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33205:


Assignee: (was: Apache Spark)

> Bump snappy-java version to 1.1.8
> -
>
> Key: SPARK-33205
> URL: https://issues.apache.org/jira/browse/SPARK-33205
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33205) Bump snappy-java version to 1.1.8

2020-10-21 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-33205:


 Summary: Bump snappy-java version to 1.1.8
 Key: SPARK-33205
 URL: https://issues.apache.org/jira/browse/SPARK-33205
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218251#comment-17218251
 ] 

Apache Spark commented on SPARK-33204:
--

User 'akiyamaneko' has created a pull request for this issue:
https://github.com/apache/spark/pull/30119

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33204:


Assignee: (was: Apache Spark)

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33204:


Assignee: Apache Spark

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33204:


Assignee: Apache Spark

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-21 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-33204:

Summary: `Event Timeline`  in Spark Job UI sometimes cannot be opened  
(was: `Event Timeline`  in Spark Job UI sometimes cannot open)

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot open

2020-10-21 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-33204:

Attachment: reproduce.gif

> `Event Timeline`  in Spark Job UI sometimes cannot open
> ---
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot open

2020-10-21 Thread akiyamaneko (Jira)
akiyamaneko created SPARK-33204:
---

 Summary: `Event Timeline`  in Spark Job UI sometimes cannot open
 Key: SPARK-33204
 URL: https://issues.apache.org/jira/browse/SPARK-33204
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.1
Reporter: akiyamaneko
 Fix For: 3.1.0


The Event Timeline area  cannot be expanded when a spark application has some 
failed jobs.

show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33160) Allow saving/loading INT96 in parquet w/o rebasing

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218216#comment-17218216
 ] 

Apache Spark commented on SPARK-33160:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30118

> Allow saving/loading INT96 in parquet w/o rebasing
> --
>
> Key: SPARK-33160
> URL: https://issues.apache.org/jira/browse/SPARK-33160
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, Spark always performs rebasing of INT96 columns in Parquet 
> datasource but this is not required by parquet spec. This tickets aims to 
> allow users to turn off rebasing via SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33196) Expose filtered aggregation API

2020-10-21 Thread Erwan Guyomarc'h (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erwan Guyomarc'h resolved SPARK-33196.
--
Resolution: Won't Do

> Expose filtered aggregation API
> ---
>
> Key: SPARK-33196
> URL: https://issues.apache.org/jira/browse/SPARK-33196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erwan Guyomarc'h
>Priority: Minor
>
> Spark currently supports filtered aggregation but does not expose API 
> allowing to use them when using the `spark.sql.functions` package.
> It is possible to use them when writing directly SQL:
> {code:scala}
> scala> val df = spark.range(100)
> scala> df.registerTempTable("df")
> scala> spark.sql("select count(1) as classic_cnt, count(1) FILTER (WHERE id < 
> 50) from df").show()
> +---+-+ 
> |classic_cnt|count(1) FILTER (WHERE (id < CAST(50 AS BIGINT)))|
> +---+-+
> |100|   50|
> +---+-+{code}
> These aggregations are especially useful when filtering on overlapping 
> datasets (where a pivot would not work):
> {code:sql}
> SELECT 
>  AVG(revenue) FILTER (WHERE age < 25),
>  AVG(revenue) FILTER (WHERE age < 35),
>  AVG(revenue) FILTER (WHERE age < 45)
> FROM people;{code}
> I did not find an issue tracking this, hence I am creating this one and I 
> will join a PR to illustrate a possible implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218163#comment-17218163
 ] 

Apache Spark commented on SPARK-33203:
--

User 'AlessandroPatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/30104

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Priority: Minor
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33203:


Assignee: Apache Spark

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Assignee: Apache Spark
>Priority: Minor
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218162#comment-17218162
 ] 

Apache Spark commented on SPARK-33203:
--

User 'AlessandroPatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/30104

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Priority: Minor
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33203:


Assignee: Apache Spark

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Assignee: Apache Spark
>Priority: Minor
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33203:


Assignee: (was: Apache Spark)

> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Priority: Minor
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Alessandro Patti (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Patti updated SPARK-33203:
-
Description: 
The tests _{{pyspark.ml.recommendation}}_ and 
_{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
environment) with
{code:java}
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
test_raw_and_probability_prediction
 self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
atol=1))
AssertionError: False is not true{code}
{code:java}
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
_main_.ALS
Failed example:
 predictions[0]
Expected:
 Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
 Row(user=0, item=2, newPrediction=0.6929104924201965)
...{code}

  was:
The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` 
occasionally fail (depends on environment) with 
{code:java}
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
test_raw_and_probability_prediction
 self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
atol=1))
AssertionError: False is not true{code}
{code:java}
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
_main_.ALS
Failed example:
 predictions[0]
Expected:
 Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
 Row(user=0, item=2, newPrediction=0.6929104924201965)
...{code}


> Pyspark ml tests failing with rounding errors
> -
>
> Key: SPARK-33203
> URL: https://issues.apache.org/jira/browse/SPARK-33203
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Affects Versions: 3.0.1
>Reporter: Alessandro Patti
>Priority: Minor
>
> The tests _{{pyspark.ml.recommendation}}_ and 
> _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on 
> environment) with
> {code:java}
> File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
> test_raw_and_probability_prediction
>  self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1))
> AssertionError: False is not true{code}
> {code:java}
> File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
> _main_.ALS
> Failed example:
>  predictions[0]
> Expected:
>  Row(user=0, item=2, newPrediction=0.6929101347923279)
> Got:
>  Row(user=0, item=2, newPrediction=0.6929104924201965)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33203) Pyspark ml tests failing with rounding errors

2020-10-21 Thread Alessandro Patti (Jira)
Alessandro Patti created SPARK-33203:


 Summary: Pyspark ml tests failing with rounding errors
 Key: SPARK-33203
 URL: https://issues.apache.org/jira/browse/SPARK-33203
 Project: Spark
  Issue Type: Test
  Components: ML, PySpark
Affects Versions: 3.0.1
Reporter: Alessandro Patti


The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` 
occasionally fail (depends on environment) with 
{code:java}
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in 
test_raw_and_probability_prediction
 self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
atol=1))
AssertionError: False is not true{code}
{code:java}
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in 
_main_.ALS
Failed example:
 predictions[0]
Expected:
 Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
 Row(user=0, item=2, newPrediction=0.6929104924201965)
...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32785) interval with dangling part should not results null

2020-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218155#comment-17218155
 ] 

Apache Spark commented on SPARK-32785:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/30117

> interval with dangling part should not results null
> ---
>
> Key: SPARK-32785
> URL: https://issues.apache.org/jira/browse/SPARK-32785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'"
> NULL  NULLNULL
> we should fail these cases correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32185) User Guide - Monitoring

2020-10-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218143#comment-17218143
 ] 

Hyukjin Kwon commented on SPARK-32185:
--

[~a7prasad] is there any update on this :-)?

> User Guide - Monitoring
> ---
>
> Key: SPARK-32185
> URL: https://issues.apache.org/jira/browse/SPARK-32185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Abhijeet Prasad
>Priority: Major
>
> Monitoring. We should focus on how to monitor PySpark jobs.
> - Custom Worker, see also 
> https://github.com/apache/spark/tree/master/python/test_coverage to enable 
> test coverage that include worker sides too.
> - Sentry Support \(?\) 
> https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark
> - Link back https://spark.apache.org/docs/latest/monitoring.html . 
> - ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-10-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32881.
---
Fix Version/s: 3.1.0
 Assignee: Holden Karau
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/29992

> NoSuchElementException occurs during decommissioning
> 
>
> Key: SPARK-32881
> URL: https://issues.apache.org/jira/browse/SPARK-32881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> `BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
> due to `java.util.NoSuchElementException`. This happens on K8s IT testing, 
> but the main code seems to need a graceful handling of 
> `NoSuchElementException` instead of showing a naive error message.
> {code}
> private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
> Seq[ReplicateBlock] = {
> val info = blockManagerInfo(blockManagerId)
>...
> }
> {code}
> {code}
>   20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes.
>   20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
> Executor decommission.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
>   20/09/14 18:56:55 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 1
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, 172.17.0.4, 41235, None)
>   20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
>   20/09/14 18:56:55 ERROR Inbox: Ignoring error
>   java.util.NoSuchElementException
>   at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
>   20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
> (epoch 1)
>   20/09/14 18:56:58 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with 
> ID 4,  ResourceProfileId 0
>   20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 
> 40495, None)
>   20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
> /opt/spark/tests/decommissioning.py:49
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org