[jira] [Updated] (SPARK-32247) scipy installation fails with PyPy
[ https://issues.apache.org/jira/browse/SPARK-32247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32247: - Fix Version/s: 3.0.2 2.4.8 > scipy installation fails with PyPy > -- > > Key: SPARK-32247 > URL: https://issues.apache.org/jira/browse/SPARK-32247 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > PyPy also supports scipy to install. We have a few dependent PySpark test > cases. > However, it fails in Github Actions environment. We should install it and > test it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33213) Upgrade Apache Arrow to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-33213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218743#comment-17218743 ] Hyukjin Kwon commented on SPARK-33213: -- cc [~bryanc] FYI > Upgrade Apache Arrow to 2.0.0 > - > > Key: SPARK-33213 > URL: https://issues.apache.org/jira/browse/SPARK-33213 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Minor > > Apache Arrow 2.0.0 has [just been > released|https://cwiki.apache.org/confluence/display/ARROW/Arrow+2.0.0+Release]. > This proposes to upgrade Spark's Arrow dependency to use 2.0.0, from the > current 1.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218734#comment-17218734 ] Apache Spark commented on SPARK-33217: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30128 > Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33217 > URL: https://issues.apache.org/jira/browse/SPARK-33217 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33217: Assignee: Apache Spark > Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33217 > URL: https://issues.apache.org/jira/browse/SPARK-33217 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33217: Assignee: (was: Apache Spark) > Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33217 > URL: https://issues.apache.org/jira/browse/SPARK-33217 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33189) Support PyArrow 2.0.0+
[ https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218727#comment-17218727 ] Hyukjin Kwon edited comment on SPARK-33189 at 10/22/20, 4:26 AM: - This was reverted in branch-2.4 at https://github.com/apache/spark/commit/a39a0963cbac0b51388023479a8a60e0a8b924d0 and https://github.com/apache/spark/commit/88a3110c367c89a7b4931a3ab13ec91cdf0bcc41. See SPARK-33217 was (Author: hyukjin.kwon): This was reverted at https://github.com/apache/spark/commit/a39a0963cbac0b51388023479a8a60e0a8b924d0 and https://github.com/apache/spark/commit/88a3110c367c89a7b4931a3ab13ec91cdf0bcc41. See SPARK-33217 > Support PyArrow 2.0.0+ > -- > > Key: SPARK-33189 > URL: https://issues.apache.org/jira/browse/SPARK-33189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > Some tests fail with PyArrow 2.0.0 in PySpark: > {code} > == > ERROR [0.774s]: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 595, in test_grouped_over_window_with_key > .select('id', 'result').collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line > 1305, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack trace > below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 255, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 81, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 248, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, > in wrapped > result = f(key, pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in > wrapper > return f(*args, **kwargs) > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 590, in f > "{} != {}".format(expected_key[i][1], window_range) > AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': > datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, > 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, > 20, 0, 0, tzinfo=)} > {code} > We should verify and support PyArrow 2.0.0+ > See also https://github.com/apache/spark/runs/1278918780 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33190) Set upperbound of PyArrow version in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-33190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33190: - Fix Version/s: (was: 2.4.8) > Set upperbound of PyArrow version in GitHub Actions > --- > > Key: SPARK-33190 > URL: https://issues.apache.org/jira/browse/SPARK-33190 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > See SPARK-33189. Some tests look being failed with PyArrow 2.0.0+. We should > make the tests pass. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33190) Set upperbound of PyArrow version in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-33190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218730#comment-17218730 ] Hyukjin Kwon commented on SPARK-33190: -- This was reverted in branch-2.4. See SPARK-33217 > Set upperbound of PyArrow version in GitHub Actions > --- > > Key: SPARK-33190 > URL: https://issues.apache.org/jira/browse/SPARK-33190 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > See SPARK-33189. Some tests look being failed with PyArrow 2.0.0+. We should > make the tests pass. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33189) Support PyArrow 2.0.0+
[ https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33189: - Fix Version/s: (was: 2.4.8) > Support PyArrow 2.0.0+ > -- > > Key: SPARK-33189 > URL: https://issues.apache.org/jira/browse/SPARK-33189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > Some tests fail with PyArrow 2.0.0 in PySpark: > {code} > == > ERROR [0.774s]: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 595, in test_grouped_over_window_with_key > .select('id', 'result').collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line > 1305, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack trace > below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 255, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 81, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 248, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, > in wrapped > result = f(key, pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in > wrapper > return f(*args, **kwargs) > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 590, in f > "{} != {}".format(expected_key[i][1], window_range) > AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': > datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, > 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, > 20, 0, 0, tzinfo=)} > {code} > We should verify and support PyArrow 2.0.0+ > See also https://github.com/apache/spark/runs/1278918780 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+
[ https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218727#comment-17218727 ] Hyukjin Kwon commented on SPARK-33189: -- This was reverted at https://github.com/apache/spark/commit/a39a0963cbac0b51388023479a8a60e0a8b924d0 and https://github.com/apache/spark/commit/88a3110c367c89a7b4931a3ab13ec91cdf0bcc41. See SPARK-33217 > Support PyArrow 2.0.0+ > -- > > Key: SPARK-33189 > URL: https://issues.apache.org/jira/browse/SPARK-33189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Some tests fail with PyArrow 2.0.0 in PySpark: > {code} > == > ERROR [0.774s]: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 595, in test_grouped_over_window_with_key > .select('id', 'result').collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line > 1305, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack trace > below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 255, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 81, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 248, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, > in wrapped > result = f(key, pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in > wrapper > return f(*args, **kwargs) > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 590, in f > "{} != {}".format(expected_key[i][1], window_range) > AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': > datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, > 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, > 20, 0, 0, tzinfo=)} > {code} > We should verify and support PyArrow 2.0.0+ > See also https://github.com/apache/spark/runs/1278918780 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
Hyukjin Kwon created SPARK-33217: Summary: Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4 Key: SPARK-33217 URL: https://issues.apache.org/jira/browse/SPARK-33217 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 2.4.8 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33217) Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33217: - Reporter: Hyukjin Kwon (was: Dongjoon Hyun) > Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33217 > URL: https://issues.apache.org/jira/browse/SPARK-33217 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-33212. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29843 [https://github.com/apache/spark/pull/29843] > Move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-33212: --- Assignee: Chao Sun > Move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-33216. - > Set upper bound of Pandas version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33216 > URL: https://issues.apache.org/jira/browse/SPARK-33216 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33216. --- Resolution: Duplicate > Set upper bound of Pandas version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33216 > URL: https://issues.apache.org/jira/browse/SPARK-33216 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
[ https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33210. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30121 [https://github.com/apache/spark/pull/30121] > Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default > -- > > Key: SPARK-33210 > URL: https://issues.apache.org/jira/browse/SPARK-33210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The ticket aims to set the following SQL configs: > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > to EXCEPTION by default. > The reason is let users to decide should Spark modify loaded/saved timestamps > instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
[ https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33210: --- Assignee: Maxim Gekk > Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default > -- > > Key: SPARK-33210 > URL: https://issues.apache.org/jira/browse/SPARK-33210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The ticket aims to set the following SQL configs: > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > to EXCEPTION by default. > The reason is let users to decide should Spark modify loaded/saved timestamps > instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33216: Assignee: (was: Apache Spark) > Set upper bound of Pandas version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33216 > URL: https://issues.apache.org/jira/browse/SPARK-33216 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218705#comment-17218705 ] Apache Spark commented on SPARK-33216: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30127 > Set upper bound of Pandas version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33216 > URL: https://issues.apache.org/jira/browse/SPARK-33216 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4
[ https://issues.apache.org/jira/browse/SPARK-33216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33216: Assignee: Apache Spark > Set upper bound of Pandas version in GitHub Actions in branch-2.4 > - > > Key: SPARK-33216 > URL: https://issues.apache.org/jira/browse/SPARK-33216 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.8 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33216) Set upper bound of Pandas version in GitHub Actions in branch-2.4
Dongjoon Hyun created SPARK-33216: - Summary: Set upper bound of Pandas version in GitHub Actions in branch-2.4 Key: SPARK-33216 URL: https://issues.apache.org/jira/browse/SPARK-33216 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 2.4.8 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33215) Speed up event log download by skipping UI rebuild
[ https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218683#comment-17218683 ] Apache Spark commented on SPARK-33215: -- User 'baohe-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/30126 > Speed up event log download by skipping UI rebuild > -- > > Key: SPARK-33215 > URL: https://issues.apache.org/jira/browse/SPARK-33215 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Baohe Zhang >Priority: Major > > Right now, when we want to download the event logs from the spark history > server(SHS), SHS will need to parse entire the event log to rebuild UI, and > this is just for view permission checks. UI rebuilding is a time-consuming > and memory-intensive task, especially for large logs. However, this process > is unnecessary for event log download. > This patch enables SHS to check UI view permissions of a given app/attempt > for a given user, without rebuilding the UI. This is achieved by adding a > method "checkUIViewPermissions(appId: String, attemptId: Option[String], > user: String): Boolean" to many layers of history server components. > With this patch, UI rebuild can be skipped when downloading event logs from > the history server. Thus the time of downloading a GB scale event log can be > reduced from several minutes to several seconds, and the memory consumption > of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33215) Speed up event log download by skipping UI rebuild
[ https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33215: Assignee: Apache Spark > Speed up event log download by skipping UI rebuild > -- > > Key: SPARK-33215 > URL: https://issues.apache.org/jira/browse/SPARK-33215 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Baohe Zhang >Assignee: Apache Spark >Priority: Major > > Right now, when we want to download the event logs from the spark history > server(SHS), SHS will need to parse entire the event log to rebuild UI, and > this is just for view permission checks. UI rebuilding is a time-consuming > and memory-intensive task, especially for large logs. However, this process > is unnecessary for event log download. > This patch enables SHS to check UI view permissions of a given app/attempt > for a given user, without rebuilding the UI. This is achieved by adding a > method "checkUIViewPermissions(appId: String, attemptId: Option[String], > user: String): Boolean" to many layers of history server components. > With this patch, UI rebuild can be skipped when downloading event logs from > the history server. Thus the time of downloading a GB scale event log can be > reduced from several minutes to several seconds, and the memory consumption > of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33215) Speed up event log download by skipping UI rebuild
[ https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33215: Assignee: (was: Apache Spark) > Speed up event log download by skipping UI rebuild > -- > > Key: SPARK-33215 > URL: https://issues.apache.org/jira/browse/SPARK-33215 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Baohe Zhang >Priority: Major > > Right now, when we want to download the event logs from the spark history > server(SHS), SHS will need to parse entire the event log to rebuild UI, and > this is just for view permission checks. UI rebuilding is a time-consuming > and memory-intensive task, especially for large logs. However, this process > is unnecessary for event log download. > This patch enables SHS to check UI view permissions of a given app/attempt > for a given user, without rebuilding the UI. This is achieved by adding a > method "checkUIViewPermissions(appId: String, attemptId: Option[String], > user: String): Boolean" to many layers of history server components. > With this patch, UI rebuild can be skipped when downloading event logs from > the history server. Thus the time of downloading a GB scale event log can be > reduced from several minutes to several seconds, and the memory consumption > of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33203: - Assignee: Alessandro Patti (was: Apache Spark) > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Assignee: Alessandro Patti >Priority: Minor > Fix For: 3.1.0 > > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33203. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30104 [https://github.com/apache/spark/pull/30104] > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Assignee: Apache Spark >Priority: Minor > Fix For: 3.1.0 > > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33215) Speed up event log download by skipping UI rebuild
Baohe Zhang created SPARK-33215: --- Summary: Speed up event log download by skipping UI rebuild Key: SPARK-33215 URL: https://issues.apache.org/jira/browse/SPARK-33215 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.1, 2.4.7 Reporter: Baohe Zhang Right now, when we want to download the event logs from the spark history server(SHS), SHS will need to parse entire the event log to rebuild UI, and this is just for view permission checks. UI rebuilding is a time-consuming and memory-intensive task, especially for large logs. However, this process is unnecessary for event log download. This patch enables SHS to check UI view permissions of a given app/attempt for a given user, without rebuilding the UI. This is achieved by adding a method "checkUIViewPermissions(appId: String, attemptId: Option[String], user: String): Boolean" to many layers of history server components. With this patch, UI rebuild can be skipped when downloading event logs from the history server. Thus the time of downloading a GB scale event log can be reduced from several minutes to several seconds, and the memory consumption of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+
[ https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218675#comment-17218675 ] Apache Spark commented on SPARK-33189: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30125 > Support PyArrow 2.0.0+ > -- > > Key: SPARK-33189 > URL: https://issues.apache.org/jira/browse/SPARK-33189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Some tests fail with PyArrow 2.0.0 in PySpark: > {code} > == > ERROR [0.774s]: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 595, in test_grouped_over_window_with_key > .select('id', 'result').collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line > 1305, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack trace > below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 255, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 81, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 248, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, > in wrapped > result = f(key, pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in > wrapper > return f(*args, **kwargs) > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 590, in f > "{} != {}".format(expected_key[i][1], window_range) > AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': > datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, > 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, > 20, 0, 0, tzinfo=)} > {code} > We should verify and support PyArrow 2.0.0+ > See also https://github.com/apache/spark/runs/1278918780 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+
[ https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218674#comment-17218674 ] Apache Spark commented on SPARK-33189: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30124 > Support PyArrow 2.0.0+ > -- > > Key: SPARK-33189 > URL: https://issues.apache.org/jira/browse/SPARK-33189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Some tests fail with PyArrow 2.0.0 in PySpark: > {code} > == > ERROR [0.774s]: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 595, in test_grouped_over_window_with_key > .select('id', 'result').collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line > 1305, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack trace > below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 255, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 81, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 248, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, > in wrapped > result = f(key, pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in > wrapper > return f(*args, **kwargs) > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 590, in f > "{} != {}".format(expected_key[i][1], window_range) > AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': > datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, > 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, > 20, 0, 0, tzinfo=)} > {code} > We should verify and support PyArrow 2.0.0+ > See also https://github.com/apache/spark/runs/1278918780 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19297) Add ability for --packages tag to pull latest version
[ https://issues.apache.org/jira/browse/SPARK-19297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aoyuan Liao resolved SPARK-19297. - Target Version/s: 3.0.1 Resolution: Fixed > Add ability for --packages tag to pull latest version > - > > Key: SPARK-19297 > URL: https://issues.apache.org/jira/browse/SPARK-19297 > Project: Spark > Issue Type: New Feature >Affects Versions: 2.1.0 >Reporter: Steven Landes >Priority: Minor > Labels: features, newbie > Attachments: packages_latest.txt > > > It would be super-convenient, in a development environment, to be able to use > the --packages argument to point spark to the latest version of a package > instead of specifying a specific version. > For example, instead of the following: > --packages com.databricks:spark-csv_2.11:1.5.0 > I could just put in this: > --packages com.databricks:spark-csv_2.11:latest -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19297) Add ability for --packages tag to pull latest version
[ https://issues.apache.org/jira/browse/SPARK-19297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218660#comment-17218660 ] Aoyuan Liao commented on SPARK-19297: - latest.release can be used. {code:java} // bin/spark-submit --packages com.databricks:spark-csv_2.11:latest.release examples/src/main/python/pi.py 10bin/spark-submit --packages com.databricks:spark-csv_2.11:latest.release examples/src/main/python/pi.py 10:: loading settings :: url = jar:file:/home/eve/repo/spark/assembly/target/scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlIvy Default Cache set to: /home/eve/.ivy2/cacheThe jars for the packages stored in: /home/eve/.ivy2/jarscom.databricks#spark-csv_2.11 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent-a24e5fe4-814e-48d0-baf4-6ff489520dd4;1.0 confs: [default] found com.databricks#spark-csv_2.11;1.5.0 in central [1.5.0] com.databricks#spark-csv_2.11;latest.release found org.apache.commons#commons-csv;1.1 in central found com.univocity#univocity-parsers;1.5.1 in centraldownloading https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.5.0/spark-csv_2.11-1.5.0.jar ... [SUCCESSFUL ] com.databricks#spark-csv_2.11;1.5.0!spark-csv_2.11.jar (87ms)downloading https://repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar ... [SUCCESSFUL ] org.apache.commons#commons-csv;1.1!commons-csv.jar (36ms)downloading https://repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parsers-1.5.1.jar ... [SUCCESSFUL ] com.univocity#univocity-parsers;1.5.1!univocity-parsers.jar (127ms):: resolution report :: resolve 1729ms :: artifacts dl 257ms :: modules in use: com.databricks#spark-csv_2.11;1.5.0 from central in [default] com.univocity#univocity-parsers;1.5.1 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 3 | 3 | 3 | 0 || 3 | 3 | - {code} > Add ability for --packages tag to pull latest version > - > > Key: SPARK-19297 > URL: https://issues.apache.org/jira/browse/SPARK-19297 > Project: Spark > Issue Type: New Feature >Affects Versions: 2.1.0 >Reporter: Steven Landes >Priority: Minor > Labels: features, newbie > Attachments: packages_latest.txt > > > It would be super-convenient, in a development environment, to be able to use > the --packages argument to point spark to the latest version of a package > instead of specifying a specific version. > For example, instead of the following: > --packages com.databricks:spark-csv_2.11:1.5.0 > I could just put in this: > --packages com.databricks:spark-csv_2.11:latest -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21529) Improve the error message for unsupported Uniontype
[ https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218653#comment-17218653 ] Aoyuan Liao commented on SPARK-21529: - [~teabot] I think catalyst still doesn't support uniontype. But the table can be read and printed out in spark through directly loading from avro file(file => dataframe). The error message seems at least clear to me. Would you mind elaborating about how the error message should be improved? Do you suggest indicating calalyst doesn't support uniontype? > Improve the error message for unsupported Uniontype > --- > > Key: SPARK-21529 > URL: https://issues.apache.org/jira/browse/SPARK-21529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 > Environment: Qubole, DataBricks >Reporter: Elliot West >Priority: Major > Labels: hive, starter, uniontype > > We encounter errors when attempting to read Hive tables whose schema contains > the {{uniontype}}. It appears perhaps that Catalyst > does not support the {{uniontype}} which renders this table unreadable by > Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive > query engine, it is fully supported by the storage engine and also the Avro > data format, which we use for these tables. Therefore, I believe it is > a valid, usable type construct that should be supported by Spark. > We've attempted to read the table as follows: > {code} > spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' > limit 5").show > val tblread = spark.read.table("etl.tbl") > {code} > But this always results in the same error message. The pertinent error > messages are as follows (full stack trace below): > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype ... > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting > {, '('} > (line 1, pos 9) > == SQL == > uniontype -^^^ > {code} > h2. Full stack trace > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>> > at > org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at >
[jira] [Updated] (SPARK-33197) Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
[ https://issues.apache.org/jira/browse/SPARK-33197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33197: - Affects Version/s: (was: 3.0.0) 3.0.2 > Changes to spark.sql.analyzer.maxIterations do not take effect at runtime > - > > Key: SPARK-33197 > URL: https://issues.apache.org/jira/browse/SPARK-33197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Yuning Zhang >Priority: Major > > `spark.sql.analyzer.maxIterations` is not a static conf. However, changes to > it do not take effect at runtime. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory
[ https://issues.apache.org/jira/browse/SPARK-33214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218619#comment-17218619 ] Apache Spark commented on SPARK-33214: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/30122 > HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp > directory > --- > > Key: SPARK-33214 > URL: https://issues.apache.org/jira/browse/SPARK-33214 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Major > > In SPARK-22356, the {{sparkTestingDir}} used by > {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of > the downloaded Spark tarball between test executions: > {code} > // For local test, you can set `sparkTestingDir` to a static value like > `/tmp/test-spark`, to > // avoid downloading Spark of different versions in each run. > private val sparkTestingDir = new File("/tmp/test-spark") > {code} > However this doesn't work, since it gets deleted every time: > {code} > override def afterAll(): Unit = { > try { > Utils.deleteRecursively(wareHousePath) > Utils.deleteRecursively(tmpDataDir) > Utils.deleteRecursively(sparkTestingDir) > } finally { > super.afterAll() > } > } > {code} > It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases > this is not the proper place to store temporary files. We're not currently > making any good use of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory
[ https://issues.apache.org/jira/browse/SPARK-33214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33214: Assignee: (was: Apache Spark) > HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp > directory > --- > > Key: SPARK-33214 > URL: https://issues.apache.org/jira/browse/SPARK-33214 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Major > > In SPARK-22356, the {{sparkTestingDir}} used by > {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of > the downloaded Spark tarball between test executions: > {code} > // For local test, you can set `sparkTestingDir` to a static value like > `/tmp/test-spark`, to > // avoid downloading Spark of different versions in each run. > private val sparkTestingDir = new File("/tmp/test-spark") > {code} > However this doesn't work, since it gets deleted every time: > {code} > override def afterAll(): Unit = { > try { > Utils.deleteRecursively(wareHousePath) > Utils.deleteRecursively(tmpDataDir) > Utils.deleteRecursively(sparkTestingDir) > } finally { > super.afterAll() > } > } > {code} > It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases > this is not the proper place to store temporary files. We're not currently > making any good use of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory
[ https://issues.apache.org/jira/browse/SPARK-33214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33214: Assignee: Apache Spark > HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp > directory > --- > > Key: SPARK-33214 > URL: https://issues.apache.org/jira/browse/SPARK-33214 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > In SPARK-22356, the {{sparkTestingDir}} used by > {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of > the downloaded Spark tarball between test executions: > {code} > // For local test, you can set `sparkTestingDir` to a static value like > `/tmp/test-spark`, to > // avoid downloading Spark of different versions in each run. > private val sparkTestingDir = new File("/tmp/test-spark") > {code} > However this doesn't work, since it gets deleted every time: > {code} > override def afterAll(): Unit = { > try { > Utils.deleteRecursively(wareHousePath) > Utils.deleteRecursively(tmpDataDir) > Utils.deleteRecursively(sparkTestingDir) > } finally { > super.afterAll() > } > } > {code} > It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases > this is not the proper place to store temporary files. We're not currently > making any good use of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33214) HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory
Erik Krogen created SPARK-33214: --- Summary: HiveExternalCatalogVersionsSuite shouldn't use or delete hard-coded /tmp directory Key: SPARK-33214 URL: https://issues.apache.org/jira/browse/SPARK-33214 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.0.1 Reporter: Erik Krogen In SPARK-22356, the {{sparkTestingDir}} used by {{HiveExternalCatalogVersionsSuite}} became hard-coded to enable re-use of the downloaded Spark tarball between test executions: {code} // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to // avoid downloading Spark of different versions in each run. private val sparkTestingDir = new File("/tmp/test-spark") {code} However this doesn't work, since it gets deleted every time: {code} override def afterAll(): Unit = { try { Utils.deleteRecursively(wareHousePath) Utils.deleteRecursively(tmpDataDir) Utils.deleteRecursively(sparkTestingDir) } finally { super.afterAll() } } {code} It's bad that we're hard-coding to a {{/tmp}} directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33202) Fix BlockManagerDecommissioner to return the correct migration status
[ https://issues.apache.org/jira/browse/SPARK-33202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33202. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30116 [https://github.com/apache/spark/pull/30116] > Fix BlockManagerDecommissioner to return the correct migration status > - > > Key: SPARK-33202 > URL: https://issues.apache.org/jira/browse/SPARK-33202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33202) Fix BlockManagerDecommissioner to return the correct migration status
[ https://issues.apache.org/jira/browse/SPARK-33202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33202: - Assignee: Dongjoon Hyun > Fix BlockManagerDecommissioner to return the correct migration status > - > > Key: SPARK-33202 > URL: https://issues.apache.org/jira/browse/SPARK-33202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33213) Upgrade Apache Arrow to 2.0.0
Chao Sun created SPARK-33213: Summary: Upgrade Apache Arrow to 2.0.0 Key: SPARK-33213 URL: https://issues.apache.org/jira/browse/SPARK-33213 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 3.0.1 Reporter: Chao Sun Apache Arrow 2.0.0 has [just been released|https://cwiki.apache.org/confluence/display/ARROW/Arrow+2.0.0+Release]. This proposes to upgrade Spark's Arrow dependency to use 2.0.0, from the current 1.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33212: Assignee: (was: Apache Spark) > Move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33212: Assignee: Apache Spark > Move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218588#comment-17218588 ] Apache Spark commented on SPARK-33212: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/29843 > Move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29250) Upgrade to Hadoop 3.2.2
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-29250: - Summary: Upgrade to Hadoop 3.2.2 (was: Upgrade to Hadoop 3.2.1 and move to shaded client) > Upgrade to Hadoop 3.2.2 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Chao Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile
Chao Sun created SPARK-33212: Summary: Move to shaded clients for Hadoop 3.x profile Key: SPARK-33212 URL: https://issues.apache.org/jira/browse/SPARK-33212 Project: Spark Issue Type: Improvement Components: Spark Core, Spark Submit, SQL, YARN Affects Versions: 3.0.1 Reporter: Chao Sun Hadoop 3.x+ offers shaded client jars: hadoop-client-api and hadoop-client-runtime, which shade 3rd party dependencies such as Guava, protobuf, jetty etc. This Jira switches Spark to use these jars instead of hadoop-common, hadoop-client etc. Benefits include: * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, Spark depends on Hadoop to not leaking dependencies. * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both client-side and server-side Hadoop APIs from modules such as hadoop-common, hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only use public/client API from Hadoop side. * Provides a better isolation from Hadoop dependencies. In future Spark can better evolve without worrying about dependencies pulled from Hadoop side (which used to be a lot). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33064) Spark-shell does not display accented chara
[ https://issues.apache.org/jira/browse/SPARK-33064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laurent GUEMAPPE updated SPARK-33064: - Description: It seems to be a duplicate of *FLEX-18425*, which is duplicate of SDK-17398 that does not exist anymore. But the bug remains. (1) I create a txt file "café.txt" that contains two lines : {quote}Café Café {quote} (2) I type the following command : *spark.read.csv("café.txt").show()* It is displayed as following : *spark.read.csv("caf.txt").show()* But it works and it returns this : {quote}+-+ | _c0| +-+ | Caf| |Café| +-+ {quote} We notice a shift after "Caf" and "Café". (3) The two following commands works. The written textfiles have the same content as "café.txt" *spark.read.csv("café.txt").write.format("text").save("café2")* *sc.textFile("café.txt").saveAsTextFile("café3")* Once again, the Spark-shell display this : *spark.read.csv("caf.txt").write.format("text").save("caf2")* *sc.textFile("caf.txt").saveAsTextFile("caf3")* (4)If I type 7 "é" an then 7 Backspace, by using the "é" key of my french keyboard, then the scala prompt disappears. I have a new prompt when I type Return. The issue (4) as well as the shift in (2) seem to be related to the difference between counted characters and displayed characters. (5) I notice that I haven't got this issue by launching Spark from Ubuntu, thanks to "Windows Subsystem for Linux" Version 2. was: It seems to be a duplicate of *FLEX-18425*, which is duplicate of SDK-17398 that does not exist anymore. But the bug remains. (1) I create a txt file "café.txt" that contains two lines : {quote}Café Café {quote} (2) I type the following command : *spark.read.csv("café.txt").show()* It is displayed as following : *spark.read.csv("caf.txt").show()* But it works and it returns this : {quote}+-+ | _c0| +-+ | Caf| |Café| +-+ {quote} We notice a shift after "Caf" and "Café". (3) The two following commands works. The written textfiles have the same content as "café.txt" *spark.read.csv("café.txt").write.format("text").save("café2")* *sc.textFile("café.txt").saveAsTextFile("café3")* Once again, the Spark-shell display this : *spark.read.csv("caf.txt").write.format("text").save("caf2")* *sc.textFile("caf.txt").saveAsTextFile("caf3")* (4)If I type 7 "é" an then 7 Backspace, by using the "é" key of my french keyboard, then the scala prompt disappears. I have a new prompt when I type Return. The issue (4) as well as the shift in (2) seem to be related to the difference between counted characters and displayed characters. > Spark-shell does not display accented chara > --- > > Key: SPARK-33064 > URL: https://issues.apache.org/jira/browse/SPARK-33064 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.1 > Environment: Windows 10 > "Beta: Use Unicode UTF-8 for worldwide language support" has been checked. >Reporter: Laurent GUEMAPPE >Priority: Minor > > It seems to be a duplicate of *FLEX-18425*, which is duplicate of SDK-17398 > that does not exist anymore. But the bug remains. > (1) I create a txt file "café.txt" that contains two lines : > {quote}Café > Café > {quote} > (2) I type the following command : > *spark.read.csv("café.txt").show()* > It is displayed as following : > *spark.read.csv("caf.txt").show()* > But it works and it returns this : > {quote}+-+ > | _c0| > +-+ > | Caf| > |Café| > +-+ > {quote} > We notice a shift after "Caf" and "Café". > (3) The two following commands works. The written textfiles have the same > content as "café.txt" > *spark.read.csv("café.txt").write.format("text").save("café2")* > *sc.textFile("café.txt").saveAsTextFile("café3")* > > Once again, the Spark-shell display this : > *spark.read.csv("caf.txt").write.format("text").save("caf2")* > *sc.textFile("caf.txt").saveAsTextFile("caf3")* > > (4)If I type 7 "é" an then 7 Backspace, by using the "é" key of my french > keyboard, then the scala prompt disappears. I have a new prompt when I type > Return. > > The issue (4) as well as the shift in (2) seem to be related to the > difference between counted characters and displayed characters. > > (5) I notice that I haven't got this issue by launching Spark from Ubuntu, > thanks to "Windows Subsystem for Linux" Version 2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33211) Early state store eviction for left semi stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-33211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-33211: - Parent: SPARK-32883 Issue Type: Sub-task (was: Improvement) > Early state store eviction for left semi stream-stream join > --- > > Key: SPARK-33211 > URL: https://issues.apache.org/jira/browse/SPARK-33211 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > As a followup from discussion > [https://github.com/apache/spark/pull/30076/files/3918727a08c8d0d4c65ccc8ea902f77051b78b1d#r508926034] > and [https://github.com/apache/spark/pull/30076#discussion_r509222802] , for > left semi stream-stream join, the matched left side rows can be evicted from > left state store immediately, without waiting for to be below watermark. > However it needs more thought for how to implement efficiently to not iterate > all values if watermark predicate is on key. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33211) Early state store eviction for left semi stream-stream join
Cheng Su created SPARK-33211: Summary: Early state store eviction for left semi stream-stream join Key: SPARK-33211 URL: https://issues.apache.org/jira/browse/SPARK-33211 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Cheng Su As a followup from discussion [https://github.com/apache/spark/pull/30076/files/3918727a08c8d0d4c65ccc8ea902f77051b78b1d#r508926034] and [https://github.com/apache/spark/pull/30076#discussion_r509222802] , for left semi stream-stream join, the matched left side rows can be evicted from left state store immediately, without waiting for to be below watermark. However it needs more thought for how to implement efficiently to not iterate all values if watermark predicate is on key. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
[ https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218564#comment-17218564 ] Apache Spark commented on SPARK-33210: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30121 > Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default > -- > > Key: SPARK-33210 > URL: https://issues.apache.org/jira/browse/SPARK-33210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The ticket aims to set the following SQL configs: > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > to EXCEPTION by default. > The reason is let users to decide should Spark modify loaded/saved timestamps > instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
[ https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218563#comment-17218563 ] Apache Spark commented on SPARK-33210: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30121 > Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default > -- > > Key: SPARK-33210 > URL: https://issues.apache.org/jira/browse/SPARK-33210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The ticket aims to set the following SQL configs: > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > to EXCEPTION by default. > The reason is let users to decide should Spark modify loaded/saved timestamps > instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
[ https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33210: Assignee: Apache Spark > Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default > -- > > Key: SPARK-33210 > URL: https://issues.apache.org/jira/browse/SPARK-33210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The ticket aims to set the following SQL configs: > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > to EXCEPTION by default. > The reason is let users to decide should Spark modify loaded/saved timestamps > instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
[ https://issues.apache.org/jira/browse/SPARK-33210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33210: Assignee: (was: Apache Spark) > Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default > -- > > Key: SPARK-33210 > URL: https://issues.apache.org/jira/browse/SPARK-33210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The ticket aims to set the following SQL configs: > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > to EXCEPTION by default. > The reason is let users to decide should Spark modify loaded/saved timestamps > instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default
Maxim Gekk created SPARK-33210: -- Summary: Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default Key: SPARK-33210 URL: https://issues.apache.org/jira/browse/SPARK-33210 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The ticket aims to set the following SQL configs: - spark.sql.legacy.parquet.int96RebaseModeInWrite - spark.sql.legacy.parquet.int96RebaseModeInRead to EXCEPTION by default. The reason is let users to decide should Spark modify loaded/saved timestamps instead of silently shifting timestamps while rebasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33205) Bump snappy-java version to 1.1.8
[ https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-33205: --- Assignee: Takeshi Yamamuro > Bump snappy-java version to 1.1.8 > - > > Key: SPARK-33205 > URL: https://issues.apache.org/jira/browse/SPARK-33205 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > > This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33205) Bump snappy-java version to 1.1.8
[ https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-33205. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30120 [https://github.com/apache/spark/pull/30120] > Bump snappy-java version to 1.1.8 > - > > Key: SPARK-33205 > URL: https://issues.apache.org/jira/browse/SPARK-33205 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.1.0 > > > This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-33209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-33209: - Parent: SPARK-32883 Issue Type: Sub-task (was: Improvement) > Clean up unit test file UnsupportedOperationsSuite.scala > > > Key: SPARK-33209 > URL: https://issues.apache.org/jira/browse/SPARK-33209 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > As a follow up from [https://github.com/apache/spark/pull/30076,] there are > many copy-paste in the unit test file UnsupportedOperationsSuite.scala to > check different join types (inner, outer, semi) with similar code structure. > It would be helpful to clean them up and refactor to reuse code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33209) Clean up unit test file UnsupportedOperationsSuite.scala
Cheng Su created SPARK-33209: Summary: Clean up unit test file UnsupportedOperationsSuite.scala Key: SPARK-33209 URL: https://issues.apache.org/jira/browse/SPARK-33209 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.1.0 Reporter: Cheng Su As a follow up from [https://github.com/apache/spark/pull/30076,] there are many copy-paste in the unit test file UnsupportedOperationsSuite.scala to check different join types (inner, outer, semi) with similar code structure. It would be helpful to clean them up and refactor to reuse code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33207) Reduce the number of tasks launched after bucket pruning
[ https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218545#comment-17218545 ] Cheng Su commented on SPARK-33207: -- Thank [~yumwang] for bringing up the issue. We don't need to launch #-of-buckets tasks if the bucket filter pruning is taking effect ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L570] ). However, if the query has join on these bucketed tables, we still need launch these many tasks to maintain bucketed table scan's outputPartitioning property. So the decision of whether to launch fewer tasks, depend on query shape. A physical plan rule should resolve the issue but I am not sure whether it worth the effort. > Reduce the number of tasks launched after bucket pruning > > > Key: SPARK-33207 > URL: https://issues.apache.org/jira/browse/SPARK-33207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We only need to read 1 bucket, but it still launch 200 tasks. > {code:sql} > create table test_bucket using parquet clustered by (ID) sorted by (ID) into > 200 buckets AS (SELECT id FROM range(1000) cluster by id) > spark-sql> explain select * from test_bucket where id = 4; > == Physical Plan == > *(1) Project [id#7L] > +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) >+- *(1) ColumnarToRow > +- FileScan parquet default.test_bucket[id#7L] Batched: true, > DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: > InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], > PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], > ReadSchema: struct, SelectedBucketsCount: 1 out of 200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33197) Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
[ https://issues.apache.org/jira/browse/SPARK-33197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuning Zhang updated SPARK-33197: - Affects Version/s: 3.0.0 > Changes to spark.sql.analyzer.maxIterations do not take effect at runtime > - > > Key: SPARK-33197 > URL: https://issues.apache.org/jira/browse/SPARK-33197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Yuning Zhang >Priority: Major > > `spark.sql.analyzer.maxIterations` is not a static conf. However, changes to > it do not take effect at runtime. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13493) json to DataFrame to parquet does not respect case sensitiveness
[ https://issues.apache.org/jira/browse/SPARK-13493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hariharan Karthikeyan updated SPARK-13493: -- Comment: was deleted (was: !Screen Shot 2020-10-21 at 11.02.34 AM.png! ) > json to DataFrame to parquet does not respect case sensitiveness > > > Key: SPARK-13493 > URL: https://issues.apache.org/jira/browse/SPARK-13493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michel Lemay >Priority: Minor > Labels: bulk-closed > Attachments: Screen Shot 2020-10-21 at 11.02.34 AM.png > > > Not sure where the problem should be fixed exactly but here it is: > {noformat} > $ spark-shell --conf spark.sql.caseSensitive=false > scala> sqlContext.getConf("spark.sql.caseSensitive") > res2: String = false > scala> val data = List("""{"field": 1}""","""{"field": 2}""","""{"field": > 3}""","""{"field": 4}""","""{"FIELD": 5}""") > scala> val jsonDF = sqlContext.read.json(sc.parallelize(data)) > scala> jsonDF.printSchema > root > |-- FIELD: long (nullable = true) > |-- field: long (nullable = true) > {noformat} > And when persisting this as parquet: > {noformat} > scala> jsonDF.write.parquet("out") > org.apache.spark.sql.AnalysisException: Reference 'FIELD' is ambiguous, could > be: FIELD#0L, FIELD#1L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471 > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.sc > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at >
[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache
[ https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218374#comment-17218374 ] Nicholas Chammas commented on SPARK-26764: -- The SPIP PDF references a design doc, but I'm not clear on where the design doc actually is. Is this issue supposed to be linked to some other ones? Also, appendix B suggests to me that this idea would mesh well with the existing proposals to support materialized views. I could actually see this as an enhancement to those proposals, like SPARK-29038. In fact, when I look at the [design doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit#] for SPARK-29038, I see that goal 3 covers automatic query rewrites, which I think subsumes the main benefit of this proposal as compared to "traditional" materialized views. {quote}> 3. A query _rewrite_ capability to transparently rewrite a query to use a materialized view[1][2]. > a. Query rewrite capability is transparent to SQL applications. > b. Query rewrite can be disabled at the system level or on individual > materialized view. Also it can be disabled for a specified query via hint. > c. Query rewrite as a rule in optimizer should be made sure that it won’t > cause performance regression if it can use other index or cache. {quote} > [SPIP] Spark Relational Cache > - > > Key: SPARK-26764 > URL: https://issues.apache.org/jira/browse/SPARK-26764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Adrian Wang >Priority: Major > Attachments: Relational+Cache+SPIP.pdf > > > In modern database systems, relational cache is a common technology to boost > ad-hoc queries. While Spark provides cache natively, Spark SQL should be able > to utilize the relationship between relations to boost all possible queries. > In this SPIP, we will make Spark be able to utilize all defined cached > relations if possible, without explicit substitution in user query, as well > as keep some user defined cache available in different sessions. Materialized > views in many database systems provide similar function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13493) json to DataFrame to parquet does not respect case sensitiveness
[ https://issues.apache.org/jira/browse/SPARK-13493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218342#comment-17218342 ] Hariharan Karthikeyan commented on SPARK-13493: --- !Screen Shot 2020-10-21 at 11.02.34 AM.png! > json to DataFrame to parquet does not respect case sensitiveness > > > Key: SPARK-13493 > URL: https://issues.apache.org/jira/browse/SPARK-13493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michel Lemay >Priority: Minor > Labels: bulk-closed > Attachments: Screen Shot 2020-10-21 at 11.02.34 AM.png > > > Not sure where the problem should be fixed exactly but here it is: > {noformat} > $ spark-shell --conf spark.sql.caseSensitive=false > scala> sqlContext.getConf("spark.sql.caseSensitive") > res2: String = false > scala> val data = List("""{"field": 1}""","""{"field": 2}""","""{"field": > 3}""","""{"field": 4}""","""{"FIELD": 5}""") > scala> val jsonDF = sqlContext.read.json(sc.parallelize(data)) > scala> jsonDF.printSchema > root > |-- FIELD: long (nullable = true) > |-- field: long (nullable = true) > {noformat} > And when persisting this as parquet: > {noformat} > scala> jsonDF.write.parquet("out") > org.apache.spark.sql.AnalysisException: Reference 'FIELD' is ambiguous, could > be: FIELD#0L, FIELD#1L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471 > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.sc > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at >
[jira] [Updated] (SPARK-13493) json to DataFrame to parquet does not respect case sensitiveness
[ https://issues.apache.org/jira/browse/SPARK-13493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hariharan Karthikeyan updated SPARK-13493: -- Attachment: Screen Shot 2020-10-21 at 11.02.34 AM.png > json to DataFrame to parquet does not respect case sensitiveness > > > Key: SPARK-13493 > URL: https://issues.apache.org/jira/browse/SPARK-13493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michel Lemay >Priority: Minor > Labels: bulk-closed > Attachments: Screen Shot 2020-10-21 at 11.02.34 AM.png > > > Not sure where the problem should be fixed exactly but here it is: > {noformat} > $ spark-shell --conf spark.sql.caseSensitive=false > scala> sqlContext.getConf("spark.sql.caseSensitive") > res2: String = false > scala> val data = List("""{"field": 1}""","""{"field": 2}""","""{"field": > 3}""","""{"field": 4}""","""{"FIELD": 5}""") > scala> val jsonDF = sqlContext.read.json(sc.parallelize(data)) > scala> jsonDF.printSchema > root > |-- FIELD: long (nullable = true) > |-- field: long (nullable = true) > {noformat} > And when persisting this as parquet: > {noformat} > scala> jsonDF.write.parquet("out") > org.apache.spark.sql.AnalysisException: Reference 'FIELD' is ambiguous, could > be: FIELD#0L, FIELD#1L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471 > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.sc > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at >
[jira] [Created] (SPARK-33208) Update the document of SparkSession#sql
Wenchen Fan created SPARK-33208: --- Summary: Update the document of SparkSession#sql Key: SPARK-33208 URL: https://issues.apache.org/jira/browse/SPARK-33208 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Wenchen Fan We should mention that this API eagerly runs DDL/DML commands, but not for SELECT queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Francke updated SPARK-33206: - Description: SPARK-21501 changed the spark shuffle index service to be based on memory instead of the number of files. Unfortunately, there's a problem with the calculation which is based on size information provided by `ShuffleIndexInformation`. It is based purely on the file size of the cached file on disk. We're running in OOMs with very small index files (byte size ~16 bytes) but the overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 bytes, see screenshot). We need to take this into account and should probably add a fixed overhead of somewhere between 152 and 180 bytes according to my tests. I'm not 100% sure what the correct number is and it'll also depend on the architecture etc. so we can't be exact anyway. If we do that we can maybe get rid of the size field in ShuffleIndexInformation to save a few more bytes per entry. In effect this means that for small files we use up about 70-100 times as much memory as we intend to. Our NodeManagers OOM with 4GB and more of indexShuffleCache. was: dSPARK-21501 changed the spark shuffle index service to be based on memory instead of the number of files. Unfortunately, there's a problem with the calculation which is based on size information provided by `ShuffleIndexInformation`. It is based purely on the file size of the cached file on disk. We're running in OOMs with very small index files (byte size ~16 bytes) but the overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 bytes, see screenshot). We need to take this into account and should probably add a fixed overhead of somewhere between 152 and 180 bytes according to my tests. I'm not 100% sure what the correct number is and it'll also depend on the architecture etc. so we can't be exact anyway. If we do that we can maybe get rid of the size field in ShuffleIndexInformation to save a few more bytes per entry. In effect this means that for small files we use up about 70-100 times as much memory as we intend to. Our NodeManagers OOM with 4GB and more of indexShuffleCache. > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Priority: Major > Attachments: image001(1).png > > > SPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218306#comment-17218306 ] Lars Francke commented on SPARK-33206: -- I used YourKit (thank you for the free license!) and it claims that ShuffleIndexInformation uses 152 byte of retained memory when it caches a 0 byte file. > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Priority: Major > Attachments: image001(1).png > > > dSPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33207) Reduce the number of tasks launched after bucket pruning
[ https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33207: Description: We only need to read 1 bucket, but it still launch 200 tasks. {code:sql} create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id) spark-sql> explain select * from test_bucket where id = 4; == Physical Plan == *(1) Project [id#7L] +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) +- *(1) ColumnarToRow +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct, SelectedBucketsCount: 1 out of 200 {code} was: We only need to read 1 bucket, but still launch 200 tasks. {code:sql} create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id) spark-sql> explain select * from test_bucket where id = 4; == Physical Plan == *(1) Project [id#7L] +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) +- *(1) ColumnarToRow +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct, SelectedBucketsCount: 1 out of 200 {code} > Reduce the number of tasks launched after bucket pruning > > > Key: SPARK-33207 > URL: https://issues.apache.org/jira/browse/SPARK-33207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We only need to read 1 bucket, but it still launch 200 tasks. > {code:sql} > create table test_bucket using parquet clustered by (ID) sorted by (ID) into > 200 buckets AS (SELECT id FROM range(1000) cluster by id) > spark-sql> explain select * from test_bucket where id = 4; > == Physical Plan == > *(1) Project [id#7L] > +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) >+- *(1) ColumnarToRow > +- FileScan parquet default.test_bucket[id#7L] Batched: true, > DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: > InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], > PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], > ReadSchema: struct, SelectedBucketsCount: 1 out of 200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33207) Reduce the number of tasks launched after bucket pruning
[ https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218290#comment-17218290 ] Yuming Wang commented on SPARK-33207: - cc [~chengsu] > Reduce the number of tasks launched after bucket pruning > > > Key: SPARK-33207 > URL: https://issues.apache.org/jira/browse/SPARK-33207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We only need to read 1 bucket, but it still launch 200 tasks. > {code:sql} > create table test_bucket using parquet clustered by (ID) sorted by (ID) into > 200 buckets AS (SELECT id FROM range(1000) cluster by id) > spark-sql> explain select * from test_bucket where id = 4; > == Physical Plan == > *(1) Project [id#7L] > +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) >+- *(1) ColumnarToRow > +- FileScan parquet default.test_bucket[id#7L] Batched: true, > DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: > InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], > PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], > ReadSchema: struct, SelectedBucketsCount: 1 out of 200 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33207) Reduce the number of tasks launched after bucket pruning
Yuming Wang created SPARK-33207: --- Summary: Reduce the number of tasks launched after bucket pruning Key: SPARK-33207 URL: https://issues.apache.org/jira/browse/SPARK-33207 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang We only need to read 1 bucket, but still launch 200 tasks. {code:sql} create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id) spark-sql> explain select * from test_bucket where id = 4; == Physical Plan == *(1) Project [id#7L] +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) +- *(1) ColumnarToRow +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct, SelectedBucketsCount: 1 out of 200 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Francke updated SPARK-33206: - Description: dSPARK-21501 changed the spark shuffle index service to be based on memory instead of the number of files. Unfortunately, there's a problem with the calculation which is based on size information provided by `ShuffleIndexInformation`. It is based purely on the file size of the cached file on disk. We're running in OOMs with very small index files (byte size ~16 bytes) but the overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 bytes, see screenshot). We need to take this into account and should probably add a fixed overhead of somewhere between 152 and 180 bytes according to my tests. I'm not 100% sure what the correct number is and it'll also depend on the architecture etc. so we can't be exact anyway. If we do that we can maybe get rid of the size field in ShuffleIndexInformation to save a few more bytes per entry. In effect this means that for small files we use up about 70-100 times as much memory as we intend to. Our NodeManagers OOM with 4GB and more of indexShuffleCache. was: SPARK-21501 changed the spark shuffle index service to be based on memory instead of the number of files. Unfortunately, there's a problem with the calculation which is based on size information provided by `ShuffleIndexInformation`. It is based purely on the file size of the cached file on disk. We're running in OOMs with very small index files (byte size ~16 bytes) but the overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 bytes, see screenshot). We need to take this into account and should probably add a fixed overhead of somewhere between 152 and 180 bytes according to my tests. I'm not 100% sure what the correct number is and it'll also depend on the architecture etc. so we can't be exact anyway. If we do that we can maybe get rid of the size field in ShuffleIndexInformation to save a few more bytes per entry. In effect this means that for small files we use up about 70-100 times as much memory as we intend to. Our NodeManagers OOM with 4GB and more of indexShuffleCache. > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Priority: Major > Attachments: image001(1).png > > > dSPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based
[ https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218285#comment-17218285 ] Lars Francke commented on SPARK-21501: -- Just FYI for others stumbling across this: This has a bug in how the memory is calculated and might use way more than the 100MB it intends to. See SPARK-33206 for details. > Spark shuffle index cache size should be memory based > - > > Key: SPARK-21501 > URL: https://issues.apache.org/jira/browse/SPARK-21501 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Assignee: Sanket Reddy >Priority: Major > Fix For: 2.3.0 > > > Right now the spark shuffle service has a cache for index files. It is based > on a # of files cached (spark.shuffle.service.index.cache.entries). This can > cause issues if people have a lot of reducers because the size of each entry > can fluctuate based on the # of reducers. > We saw an issues with a job that had 17 reducers and it caused NM with > spark shuffle service to use 700-800MB or memory in NM by itself. > We should change this cache to be memory based and only allow a certain > memory size used. When I say memory based I mean the cache should have a > limit of say 100MB. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Francke updated SPARK-33206: - Attachment: image001(1).png > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Priority: Major > Attachments: image001(1).png > > > SPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
Lars Francke created SPARK-33206: Summary: Spark Shuffle Index Cache calculates memory usage wrong Key: SPARK-33206 URL: https://issues.apache.org/jira/browse/SPARK-33206 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.0.1, 2.4.0 Reporter: Lars Francke Attachments: image001(1).png SPARK-21501 changed the spark shuffle index service to be based on memory instead of the number of files. Unfortunately, there's a problem with the calculation which is based on size information provided by `ShuffleIndexInformation`. It is based purely on the file size of the cached file on disk. We're running in OOMs with very small index files (byte size ~16 bytes) but the overhead of the ShuffleIndexInformation around this is much larger (e.g. 184 bytes, see screenshot). We need to take this into account and should probably add a fixed overhead of somewhere between 152 and 180 bytes according to my tests. I'm not 100% sure what the correct number is and it'll also depend on the architecture etc. so we can't be exact anyway. If we do that we can maybe get rid of the size field in ShuffleIndexInformation to save a few more bytes per entry. In effect this means that for small files we use up about 70-100 times as much memory as we intend to. Our NodeManagers OOM with 4GB and more of indexShuffleCache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33205) Bump snappy-java version to 1.1.8
[ https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218267#comment-17218267 ] Apache Spark commented on SPARK-33205: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/30120 > Bump snappy-java version to 1.1.8 > - > > Key: SPARK-33205 > URL: https://issues.apache.org/jira/browse/SPARK-33205 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33205) Bump snappy-java version to 1.1.8
[ https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33205: Assignee: Apache Spark > Bump snappy-java version to 1.1.8 > - > > Key: SPARK-33205 > URL: https://issues.apache.org/jira/browse/SPARK-33205 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Major > > This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33205) Bump snappy-java version to 1.1.8
[ https://issues.apache.org/jira/browse/SPARK-33205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33205: Assignee: (was: Apache Spark) > Bump snappy-java version to 1.1.8 > - > > Key: SPARK-33205 > URL: https://issues.apache.org/jira/browse/SPARK-33205 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33205) Bump snappy-java version to 1.1.8
Takeshi Yamamuro created SPARK-33205: Summary: Bump snappy-java version to 1.1.8 Key: SPARK-33205 URL: https://issues.apache.org/jira/browse/SPARK-33205 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro This ticket aims at upgrading snappy-java from 1.1.7.5 to 1.1.8. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218251#comment-17218251 ] Apache Spark commented on SPARK-33204: -- User 'akiyamaneko' has created a pull request for this issue: https://github.com/apache/spark/pull/30119 > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33204: Assignee: (was: Apache Spark) > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33204: Assignee: Apache Spark > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: Apache Spark >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33204: Assignee: Apache Spark > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: Apache Spark >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-33204: Summary: `Event Timeline` in Spark Job UI sometimes cannot be opened (was: `Event Timeline` in Spark Job UI sometimes cannot open) > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot open
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-33204: Attachment: reproduce.gif > `Event Timeline` in Spark Job UI sometimes cannot open > --- > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot open
akiyamaneko created SPARK-33204: --- Summary: `Event Timeline` in Spark Job UI sometimes cannot open Key: SPARK-33204 URL: https://issues.apache.org/jira/browse/SPARK-33204 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.1 Reporter: akiyamaneko Fix For: 3.1.0 The Event Timeline area cannot be expanded when a spark application has some failed jobs. show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33160) Allow saving/loading INT96 in parquet w/o rebasing
[ https://issues.apache.org/jira/browse/SPARK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218216#comment-17218216 ] Apache Spark commented on SPARK-33160: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30118 > Allow saving/loading INT96 in parquet w/o rebasing > -- > > Key: SPARK-33160 > URL: https://issues.apache.org/jira/browse/SPARK-33160 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > Currently, Spark always performs rebasing of INT96 columns in Parquet > datasource but this is not required by parquet spec. This tickets aims to > allow users to turn off rebasing via SQL config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33196) Expose filtered aggregation API
[ https://issues.apache.org/jira/browse/SPARK-33196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erwan Guyomarc'h resolved SPARK-33196. -- Resolution: Won't Do > Expose filtered aggregation API > --- > > Key: SPARK-33196 > URL: https://issues.apache.org/jira/browse/SPARK-33196 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erwan Guyomarc'h >Priority: Minor > > Spark currently supports filtered aggregation but does not expose API > allowing to use them when using the `spark.sql.functions` package. > It is possible to use them when writing directly SQL: > {code:scala} > scala> val df = spark.range(100) > scala> df.registerTempTable("df") > scala> spark.sql("select count(1) as classic_cnt, count(1) FILTER (WHERE id < > 50) from df").show() > +---+-+ > |classic_cnt|count(1) FILTER (WHERE (id < CAST(50 AS BIGINT)))| > +---+-+ > |100| 50| > +---+-+{code} > These aggregations are especially useful when filtering on overlapping > datasets (where a pivot would not work): > {code:sql} > SELECT > AVG(revenue) FILTER (WHERE age < 25), > AVG(revenue) FILTER (WHERE age < 35), > AVG(revenue) FILTER (WHERE age < 45) > FROM people;{code} > I did not find an issue tracking this, hence I am creating this one and I > will join a PR to illustrate a possible implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218163#comment-17218163 ] Apache Spark commented on SPARK-33203: -- User 'AlessandroPatti' has created a pull request for this issue: https://github.com/apache/spark/pull/30104 > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Priority: Minor > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33203: Assignee: Apache Spark > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Assignee: Apache Spark >Priority: Minor > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218162#comment-17218162 ] Apache Spark commented on SPARK-33203: -- User 'AlessandroPatti' has created a pull request for this issue: https://github.com/apache/spark/pull/30104 > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Priority: Minor > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33203: Assignee: Apache Spark > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Assignee: Apache Spark >Priority: Minor > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33203: Assignee: (was: Apache Spark) > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Priority: Minor > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33203) Pyspark ml tests failing with rounding errors
[ https://issues.apache.org/jira/browse/SPARK-33203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Patti updated SPARK-33203: - Description: The tests _{{pyspark.ml.recommendation}}_ and _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on environment) with {code:java} File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1)) AssertionError: False is not true{code} {code:java} File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS Failed example: predictions[0] Expected: Row(user=0, item=2, newPrediction=0.6929101347923279) Got: Row(user=0, item=2, newPrediction=0.6929104924201965) ...{code} was: The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` occasionally fail (depends on environment) with {code:java} File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1)) AssertionError: False is not true{code} {code:java} File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS Failed example: predictions[0] Expected: Row(user=0, item=2, newPrediction=0.6929101347923279) Got: Row(user=0, item=2, newPrediction=0.6929104924201965) ...{code} > Pyspark ml tests failing with rounding errors > - > > Key: SPARK-33203 > URL: https://issues.apache.org/jira/browse/SPARK-33203 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.1 >Reporter: Alessandro Patti >Priority: Minor > > The tests _{{pyspark.ml.recommendation}}_ and > _{{pyspark.ml.tests.test_algorithms}}_ occasionally fail (depends on > environment) with > {code:java} > File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in > test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1)) > AssertionError: False is not true{code} > {code:java} > File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in > _main_.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.6929101347923279) > Got: > Row(user=0, item=2, newPrediction=0.6929104924201965) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33203) Pyspark ml tests failing with rounding errors
Alessandro Patti created SPARK-33203: Summary: Pyspark ml tests failing with rounding errors Key: SPARK-33203 URL: https://issues.apache.org/jira/browse/SPARK-33203 Project: Spark Issue Type: Test Components: ML, PySpark Affects Versions: 3.0.1 Reporter: Alessandro Patti The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` occasionally fail (depends on environment) with {code:java} File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1)) AssertionError: False is not true{code} {code:java} File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS Failed example: predictions[0] Expected: Row(user=0, item=2, newPrediction=0.6929101347923279) Got: Row(user=0, item=2, newPrediction=0.6929104924201965) ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32785) interval with dangling part should not results null
[ https://issues.apache.org/jira/browse/SPARK-32785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218155#comment-17218155 ] Apache Spark commented on SPARK-32785: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/30117 > interval with dangling part should not results null > --- > > Key: SPARK-32785 > URL: https://issues.apache.org/jira/browse/SPARK-32785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'" > NULL NULLNULL > we should fail these cases correctly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32185) User Guide - Monitoring
[ https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218143#comment-17218143 ] Hyukjin Kwon commented on SPARK-32185: -- [~a7prasad] is there any update on this :-)? > User Guide - Monitoring > --- > > Key: SPARK-32185 > URL: https://issues.apache.org/jira/browse/SPARK-32185 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Abhijeet Prasad >Priority: Major > > Monitoring. We should focus on how to monitor PySpark jobs. > - Custom Worker, see also > https://github.com/apache/spark/tree/master/python/test_coverage to enable > test coverage that include worker sides too. > - Sentry Support \(?\) > https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark > - Link back https://spark.apache.org/docs/latest/monitoring.html . > - ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32881) NoSuchElementException occurs during decommissioning
[ https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32881. --- Fix Version/s: 3.1.0 Assignee: Holden Karau Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/29992 > NoSuchElementException occurs during decommissioning > > > Key: SPARK-32881 > URL: https://issues.apache.org/jira/browse/SPARK-32881 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > `BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` > due to `java.util.NoSuchElementException`. This happens on K8s IT testing, > but the main code seems to need a graceful handling of > `NoSuchElementException` instead of showing a naive error message. > {code} > private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): > Seq[ReplicateBlock] = { > val info = blockManagerInfo(blockManagerId) >... > } > {code} > {code} > 20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors > from Kubernetes. > 20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script > to lifecycle > 20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: > Executor decommission. > 20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested > 20/09/14 18:56:55 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove > non-existent executor 1 > 20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove > executor 1 from BlockManagerMaster. > 20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, 172.17.0.4, 41235, None) > 20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1) > 20/09/14 18:56:55 ERROR Inbox: Ignoring error > java.util.NoSuchElementException > at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833) > at > org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383) > at > org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove > executor 1 from BlockManagerMaster. > 20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 > (epoch 1) > 20/09/14 18:56:58 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered > executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with > ID 4, ResourceProfileId 0 > 20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block > manager 172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, > 40495, None) > 20/09/14 18:57:23 INFO SparkContext: Starting job: count at > /opt/spark/tests/decommissioning.py:49 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org