[jira] [Commented] (SPARK-28121) String Functions: decode/encode can not accept 'escape' and 'hex' as charset
[ https://issues.apache.org/jira/browse/SPARK-28121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935586#comment-16935586 ] jiaan.geng commented on SPARK-28121: I write a test TransformTest as follows: {code:java} public class TransformTest { public static void main(String[] args) { String s = "1234567890"; byte[] bytes = s.getBytes(); System.out.println("16进制:" + binary(bytes, 16)); System.exit(0); } public static String binary(byte[] bytes, int radix){ return new BigInteger(1, bytes).toString(radix);// 这里的1代表正数 } } // the result is 31323334353637383930{code} The java code transform string to hex. the result is the same as the uncode function of Postgresql. > String Functions: decode/encode can not accept 'escape' and 'hex' as charset > > > Key: SPARK-28121 > URL: https://issues.apache.org/jira/browse/SPARK-28121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > postgres=# select decode('1234567890','escape'); > decode > > \x31323334353637383930 > (1 row) > {noformat} > {noformat} > spark-sql> select decode('1234567890','escape'); > 19/06/20 01:57:33 ERROR SparkSQLDriver: Failed in [select > decode('1234567890','escape')] > java.io.UnsupportedEncodingException: escape > at java.lang.StringCoding.decode(StringCoding.java:190) > at java.lang.String.(String.java:426) > at java.lang.String.(String.java:491) > ... > spark-sql> select decode('ff','hex'); > 19/08/16 21:44:55 ERROR SparkSQLDriver: Failed in [select decode('ff','hex')] > java.io.UnsupportedEncodingException: hex > at java.lang.StringCoding.decode(StringCoding.java:190) > at java.lang.String.(String.java:426) > at java.lang.String.(String.java:491) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28121) String Functions: decode/encode can not accept 'escape' and 'hex' as charset
[ https://issues.apache.org/jira/browse/SPARK-28121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935582#comment-16935582 ] jiaan.geng commented on SPARK-28121: [~smilegator] I find the encode function of Postgresql exists some issue. First, according to the description of encode function, we know this function could encode binary data into a textual representation. Supported formats are: {{base64}}, {{hex}}, {{escape}}. {{escape}} converts zero bytes and high-bit-set bytes to octal sequences ({{\}}_{{nnn}}_) and doubles backslashes. But, I find the behavior is different as follows: {code:java} select encode(E'123//000456'::bytea, 'escape'); select encode('123//000456'::bytea, 'escape'); // all the result is '123//000456' {code} It seems no effective. The encode function of MySql is used to encrypt and uncode used to decrypt. The encode function of Vertica is used to compare. No other mainstream database implements this function. > String Functions: decode/encode can not accept 'escape' and 'hex' as charset > > > Key: SPARK-28121 > URL: https://issues.apache.org/jira/browse/SPARK-28121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > postgres=# select decode('1234567890','escape'); > decode > > \x31323334353637383930 > (1 row) > {noformat} > {noformat} > spark-sql> select decode('1234567890','escape'); > 19/06/20 01:57:33 ERROR SparkSQLDriver: Failed in [select > decode('1234567890','escape')] > java.io.UnsupportedEncodingException: escape > at java.lang.StringCoding.decode(StringCoding.java:190) > at java.lang.String.(String.java:426) > at java.lang.String.(String.java:491) > ... > spark-sql> select decode('ff','hex'); > 19/08/16 21:44:55 ERROR SparkSQLDriver: Failed in [select decode('ff','hex')] > java.io.UnsupportedEncodingException: hex > at java.lang.StringCoding.decode(StringCoding.java:190) > at java.lang.String.(String.java:426) > at java.lang.String.(String.java:491) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29205) Pyspark tests failed for suspected performance problem on ARM
[ https://issues.apache.org/jira/browse/SPARK-29205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935578#comment-16935578 ] huangtianhua commented on SPARK-29205: -- And we found there is a similar issue community faced before: [https://github.com/apache/spark/commit/ab76900fedc05df7080c9b6c81d65a3f260c1c26#diff-f7e50078760ce2d40f35e4c3b9112227,] if we increase the timeout the tests are pass. > Pyspark tests failed for suspected performance problem on ARM > - > > Key: SPARK-29205 > URL: https://issues.apache.org/jira/browse/SPARK-29205 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 > Environment: OS: Ubuntu16.04 > Arch: aarch64 > Host: Virtual Machine >Reporter: zhao bo >Priority: Major > > We test the pyspark on ARM VM. But found some test fails, once we change the > source code to extend the wait time for making sure those test tasks had > finished, then the test will pass. > > The affected test cases including: > pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_parameter_convergence > pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_convergence > pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy > pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > The error message about above test fails: > == > FAIL: test_parameter_convergence > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) > Test that the model parameters improve with streaming data. > -- > Traceback (most recent call last): > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 429, in test_parameter_convergen ce > self._eventually(condition, catch_assertions=True) > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 425, in condition > self.assertEqual(len(model_weights), len(batches)) > AssertionError: 6 != 10 > > > == > FAIL: test_convergence > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > -- > Traceback (most recent call last): > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 292, in test_convergence > self._eventually(condition, 60.0, catch_assertions=True) > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 288, in condition > self.assertEqual(len(models), len(input_batches)) > AssertionError: 19 != 20 > > == > FAIL: test_parameter_accuracy > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > -- > Traceback (most recent call last): > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 266, in test_parameter_accuracy > self._eventually(condition, catch_assertions=True) > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 263, in condition > self.assertAlmostEqual(rel, 0.1, 1) > AssertionError: 0.21309223935797794 != 0.1 within 1 places > > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressio
[jira] [Commented] (SPARK-29210) Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-29210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935577#comment-16935577 ] Sharanabasappa G Keriwaddi commented on SPARK-29210: I will work on this > Spark 3.0 Migration guide should contain the details of to_utc_timestamp > function and from_utc_timestamp > > > Key: SPARK-29210 > URL: https://issues.apache.org/jira/browse/SPARK-29210 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > to_utc_timestamp and from_utc_timestamp are deprecated in Spark 3.0. Details > should be contain in Spark 3.0 Spark Migration guide. > > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > from_utc_timestamp('2016-08-31', 'Asia/Seoul'); > Error: org.apache.spark.sql.AnalysisException: The from_utc_timestamp > function has been disabled since Spark 3.0.Set > spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; > line 1 pos 7 (state=,code=0) > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > to_utc_timestamp('2016-08-31', 'Asia/Seoul'); > Error: org.apache.spark.sql.AnalysisException: The to_utc_timestamp function > has been disabled since Spark 3.0. Set > spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; > line 1 pos 7 (state=,code=0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29209) Print build environment variables to Github
[ https://issues.apache.org/jira/browse/SPARK-29209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29209: Description: This PR makes it support print AMPLAB_JENKINS_BUILD_TOOL, AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this test finished. (was: When running tests for a pull request on Jenkins, you can add special phrases to the title of your pull request to change testing behavior. This includes: [test-maven] - signals to test the pull request using maven [test-hadoop2.7] - signals to test using Spark’s Hadoop 2.7 profile [test-hadoop3.2] - signals to test using Spark’s Hadoop 3.2 profile [test-hadoop3.2][test-java11] - signals to test using Spark’s Hadoop 3.2 profile with JDK 11) > Print build environment variables to Github > --- > > Key: SPARK-29209 > URL: https://issues.apache.org/jira/browse/SPARK-29209 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This PR makes it support print AMPLAB_JENKINS_BUILD_TOOL, > AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this > test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29209) Print build environment variables to Github
[ https://issues.apache.org/jira/browse/SPARK-29209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29209: Description: Makes it support print AMPLAB_JENKINS_BUILD_TOOL, AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this test finished. (was: This PR makes it support print AMPLAB_JENKINS_BUILD_TOOL, AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this test finished.) > Print build environment variables to Github > --- > > Key: SPARK-29209 > URL: https://issues.apache.org/jira/browse/SPARK-29209 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Makes it support print AMPLAB_JENKINS_BUILD_TOOL, > AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this > test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29210) Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp
ABHISHEK KUMAR GUPTA created SPARK-29210: Summary: Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp Key: SPARK-29210 URL: https://issues.apache.org/jira/browse/SPARK-29210 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA to_utc_timestamp and from_utc_timestamp are deprecated in Spark 3.0. Details should be contain in Spark 3.0 Spark Migration guide. 0: jdbc:hive2://10.18.19.208:23040/default> SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul'); Error: org.apache.spark.sql.AnalysisException: The from_utc_timestamp function has been disabled since Spark 3.0.Set spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; line 1 pos 7 (state=,code=0) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT to_utc_timestamp('2016-08-31', 'Asia/Seoul'); Error: org.apache.spark.sql.AnalysisException: The to_utc_timestamp function has been disabled since Spark 3.0. Set spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; line 1 pos 7 (state=,code=0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29209) Print build environment variables to Github
Yuming Wang created SPARK-29209: --- Summary: Print build environment variables to Github Key: SPARK-29209 URL: https://issues.apache.org/jira/browse/SPARK-29209 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Yuming Wang When running tests for a pull request on Jenkins, you can add special phrases to the title of your pull request to change testing behavior. This includes: [test-maven] - signals to test the pull request using maven [test-hadoop2.7] - signals to test using Spark’s Hadoop 2.7 profile [test-hadoop3.2] - signals to test using Spark’s Hadoop 3.2 profile [test-hadoop3.2][test-java11] - signals to test using Spark’s Hadoop 3.2 profile with JDK 11 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29207) Document LIST JAR in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-29207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935574#comment-16935574 ] Sandeep Katta commented on SPARK-29207: --- List jar or List files are part of list command, you can use any one of the jira to fix this > Document LIST JAR in SQL Reference > -- > > Key: SPARK-29207 > URL: https://issues.apache.org/jira/browse/SPARK-29207 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29208) Document LIST FILE in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-29208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935568#comment-16935568 ] Huaxin Gao commented on SPARK-29208: I will work on this > Document LIST FILE in SQL Reference > --- > > Key: SPARK-29208 > URL: https://issues.apache.org/jira/browse/SPARK-29208 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29208) Document LIST FILE in SQL Reference
Huaxin Gao created SPARK-29208: -- Summary: Document LIST FILE in SQL Reference Key: SPARK-29208 URL: https://issues.apache.org/jira/browse/SPARK-29208 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29207) Document LIST JAR in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-29207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935567#comment-16935567 ] Huaxin Gao commented on SPARK-29207: I will work on this > Document LIST JAR in SQL Reference > -- > > Key: SPARK-29207 > URL: https://issues.apache.org/jira/browse/SPARK-29207 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29207) Document LIST JAR in SQL Reference
Huaxin Gao created SPARK-29207: -- Summary: Document LIST JAR in SQL Reference Key: SPARK-29207 URL: https://issues.apache.org/jira/browse/SPARK-29207 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads
[ https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935556#comment-16935556 ] Min Shen commented on SPARK-29206: -- We initially tried an alternative approach to resolve this issue by implementing a custom Netty EventExecutorChooserFactory, so Spark shuffle Netty server can be a bit more intelligent at choosing a thread among an EventLoopGroup to be associated with a new channel. In latest version of Netty 4.1, each (Nio|Epoll)EventLoop exposes information about its number of pending tasks and registered channels. We initially thought we could use these metrics to do better at load balancing so to avoid registering a channel with a busy EventLoop. However, as we implemented this approach, we realized that the state of an EventLoop at channel registration time could be very different from when an RPC request from this channel is placed in the task queue of this EventLoop later. Since there is no way to precisely tell the state of an EventLoop in the future, we gave up on this approach. > Number of shuffle Netty server threads should be a multiple of number of > chunk fetch handler threads > > > Key: SPARK-29206 > URL: https://issues.apache.org/jira/browse/SPARK-29206 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Min Shen >Priority: Major > > In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool > to handle the slow-to-process chunk fetch requests in order to improve the > responsiveness of shuffle service for RPC requests. > Initially, we thought by making the number of Netty server threads larger > than the number of chunk fetch handler threads, it would reserve some threads > for RPC requests thus resolving the various RPC request timeout issues we > experienced previously. The solution worked in our cluster initially. > However, as the number of Spark applications in our cluster continues to > increase, we saw the RPC request (SASL authentication specifically) timeout > issue again: > {noformat} > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > {noformat} > After further investigation, we realized that as the number of concurrent > clients connecting to a shuffle service increases, it becomes _VERY_ > important to configure the number of Netty server threads and number of chunk > fetch handler threads correctly. Specifically, the number of Netty server > threads needs to be a multiple of the number of chunk fetch handler threads. > The reason is explained in details below: > When a channel is established on the Netty server, it is registered with both > the Netty server default EventLoopGroup and the chunk fetch handler > EventLoopGroup. Once registered, this channel sticks with a given thread in > both EventLoopGroups, i.e. all requests from this channel is going to be > handled by the same thread. Right now, Spark shuffle Netty server uses the > default Netty strategy to select a thread from a EventLoopGroup to be > associated with a new channel, which is simply round-robin (Netty's > DefaultEventExecutorChooserFactory). > In SPARK-24355, with the introduced chunk fetch handler thread pool, all > chunk fetch requests from a given channel will be first added to the task > queue of the chunk fetch handler thread associated with that channel. When > the requests get processed, the chunk fetch request handler thread will > submit a task to the task queue of the Netty server thread that's also > associated with this channel. If the number of Netty server threads is not a > multiple of the number of chunk fetch handler threads, it would become a > problem when the server has a large number of concurrent connections. > Assume we configure the number of Netty server threads as 40 and the > percentage of chunk fetch handler threads as 87, which leads
[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads
[ https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935552#comment-16935552 ] Min Shen commented on SPARK-29206: -- [~redsanket], [~tgraves], Since you worked on committing the original patch, would appreciate your comments here. > Number of shuffle Netty server threads should be a multiple of number of > chunk fetch handler threads > > > Key: SPARK-29206 > URL: https://issues.apache.org/jira/browse/SPARK-29206 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Min Shen >Priority: Major > > In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool > to handle the slow-to-process chunk fetch requests in order to improve the > responsiveness of shuffle service for RPC requests. > Initially, we thought by making the number of Netty server threads larger > than the number of chunk fetch handler threads, it would reserve some threads > for RPC requests thus resolving the various RPC request timeout issues we > experienced previously. The solution worked in our cluster initially. > However, as the number of Spark applications in our cluster continues to > increase, we saw the RPC request (SASL authentication specifically) timeout > issue again: > {noformat} > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > {noformat} > After further investigation, we realized that as the number of concurrent > clients connecting to a shuffle service increases, it becomes _VERY_ > important to configure the number of Netty server threads and number of chunk > fetch handler threads correctly. Specifically, the number of Netty server > threads needs to be a multiple of the number of chunk fetch handler threads. > The reason is explained in details below: > When a channel is established on the Netty server, it is registered with both > the Netty server default EventLoopGroup and the chunk fetch handler > EventLoopGroup. Once registered, this channel sticks with a given thread in > both EventLoopGroups, i.e. all requests from this channel is going to be > handled by the same thread. Right now, Spark shuffle Netty server uses the > default Netty strategy to select a thread from a EventLoopGroup to be > associated with a new channel, which is simply round-robin (Netty's > DefaultEventExecutorChooserFactory). > In SPARK-24355, with the introduced chunk fetch handler thread pool, all > chunk fetch requests from a given channel will be first added to the task > queue of the chunk fetch handler thread associated with that channel. When > the requests get processed, the chunk fetch request handler thread will > submit a task to the task queue of the Netty server thread that's also > associated with this channel. If the number of Netty server threads is not a > multiple of the number of chunk fetch handler threads, it would become a > problem when the server has a large number of concurrent connections. > Assume we configure the number of Netty server threads as 40 and the > percentage of chunk fetch handler threads as 87, which leads to 35 chunk > fetch handler threads. Then according to the round-robin policy, channel 0, > 40, 80, 120, 160, 200, 240, and 280 will all be associated with the 1st Netty > server thread in the default EventLoopGroup. However, since the chunk fetch > handler thread pool only has 35 threads, out of these 8 channels, only > channel 0 and 280 will be associated with the same chunk fetch handler > thread. Thus, channel 0, 40, 80, 120, 160, 200, 240 will all be associated > with different chunk fetch handler threads but associated with the same Netty > server thread. This means, the 7 different chunk fetch handler threads > associated with these channels could potentially submit tasks to the task > queue of the same Netty server thread at the same time. This would
[jira] [Created] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads
Min Shen created SPARK-29206: Summary: Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads Key: SPARK-29206 URL: https://issues.apache.org/jira/browse/SPARK-29206 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.0 Reporter: Min Shen In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool to handle the slow-to-process chunk fetch requests in order to improve the responsiveness of shuffle service for RPC requests. Initially, we thought by making the number of Netty server threads larger than the number of chunk fetch handler threads, it would reserve some threads for RPC requests thus resolving the various RPC request timeout issues we experienced previously. The solution worked in our cluster initially. However, as the number of Spark applications in our cluster continues to increase, we saw the RPC request (SASL authentication specifically) timeout issue again: {noformat} java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) {noformat} After further investigation, we realized that as the number of concurrent clients connecting to a shuffle service increases, it becomes _VERY_ important to configure the number of Netty server threads and number of chunk fetch handler threads correctly. Specifically, the number of Netty server threads needs to be a multiple of the number of chunk fetch handler threads. The reason is explained in details below: When a channel is established on the Netty server, it is registered with both the Netty server default EventLoopGroup and the chunk fetch handler EventLoopGroup. Once registered, this channel sticks with a given thread in both EventLoopGroups, i.e. all requests from this channel is going to be handled by the same thread. Right now, Spark shuffle Netty server uses the default Netty strategy to select a thread from a EventLoopGroup to be associated with a new channel, which is simply round-robin (Netty's DefaultEventExecutorChooserFactory). In SPARK-24355, with the introduced chunk fetch handler thread pool, all chunk fetch requests from a given channel will be first added to the task queue of the chunk fetch handler thread associated with that channel. When the requests get processed, the chunk fetch request handler thread will submit a task to the task queue of the Netty server thread that's also associated with this channel. If the number of Netty server threads is not a multiple of the number of chunk fetch handler threads, it would become a problem when the server has a large number of concurrent connections. Assume we configure the number of Netty server threads as 40 and the percentage of chunk fetch handler threads as 87, which leads to 35 chunk fetch handler threads. Then according to the round-robin policy, channel 0, 40, 80, 120, 160, 200, 240, and 280 will all be associated with the 1st Netty server thread in the default EventLoopGroup. However, since the chunk fetch handler thread pool only has 35 threads, out of these 8 channels, only channel 0 and 280 will be associated with the same chunk fetch handler thread. Thus, channel 0, 40, 80, 120, 160, 200, 240 will all be associated with different chunk fetch handler threads but associated with the same Netty server thread. This means, the 7 different chunk fetch handler threads associated with these channels could potentially submit tasks to the task queue of the same Netty server thread at the same time. This would lead to 7 slow-to-process requests sitting in the task queue. If an RPC request is put in the task queue after these 7 requests, it is very likely to timeout. In our cluster, the number of concurrent active connections to a shuffle service could go as high as 6K+ during peak. If the numbers of these thread pools are not configured correctly, our Spark applications are guaranteed to see SASL timeout issues when a shuffle service is dealing with a lot of incoming chunk fetch requests from many distinct clients, which lea
[jira] [Created] (SPARK-29205) Pyspark tests failed for suspected performance problem on ARM
zhao bo created SPARK-29205: --- Summary: Pyspark tests failed for suspected performance problem on ARM Key: SPARK-29205 URL: https://issues.apache.org/jira/browse/SPARK-29205 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Environment: OS: Ubuntu16.04 Arch: aarch64 Host: Virtual Machine Reporter: zhao bo We test the pyspark on ARM VM. But found some test fails, once we change the source code to extend the wait time for making sure those test tasks had finished, then the test will pass. The affected test cases including: pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_parameter_convergence pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_convergence pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_training_and_prediction The error message about above test fails: == FAIL: test_parameter_convergence (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) Test that the model parameters improve with streaming data. -- Traceback (most recent call last): File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 429, in test_parameter_convergen ce self._eventually(condition, catch_assertions=True) File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 74, in _eventually raise lastValue File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 65, in _eventually lastValue = condition() File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 425, in condition self.assertEqual(len(model_weights), len(batches)) AssertionError: 6 != 10 == FAIL: test_convergence (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) -- Traceback (most recent call last): File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 292, in test_convergence self._eventually(condition, 60.0, catch_assertions=True) File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 74, in _eventually raise lastValue File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 65, in _eventually lastValue = condition() File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 288, in condition self.assertEqual(len(models), len(input_batches)) AssertionError: 19 != 20 == FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) -- Traceback (most recent call last): File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 266, in test_parameter_accuracy self._eventually(condition, catch_assertions=True) File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 74, in _eventually raise lastValue File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 65, in _eventually lastValue = condition() File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 263, in condition self.assertAlmostEqual(rel, 0.1, 1) AssertionError: 0.21309223935797794 != 0.1 within 1 places == FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches -- Traceback (most recent call last): File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_predic tion self._eventually(condition, timeout=60.0) File "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 60 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0. 75, 0.74, 0.73, 0.69, 0.62, 0.71,
[jira] [Updated] (SPARK-29016) Update LICENSE and NOTICE for Hive 2.3
[ https://issues.apache.org/jira/browse/SPARK-29016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29016: Summary: Update LICENSE and NOTICE for Hive 2.3 (was: Update, fix LICENSE and NOTICE for Hive 2.3) > Update LICENSE and NOTICE for Hive 2.3 > -- > > Key: SPARK-29016 > URL: https://issues.apache.org/jira/browse/SPARK-29016 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hive 2.3 newly added jars: > {noformat} > dropwizard-metrics-hadoop-metrics2-reporter.jar > HikariCP-2.5.1.jar > hive-common-2.3.6.jar > hive-llap-common-2.3.6.jar > hive-serde-2.3.6.jar > hive-service-rpc-2.3.6.jar > hive-shims-0.23-2.3.6.jar > hive-shims-2.3.6.jar > hive-shims-common-2.3.6.jar > hive-shims-scheduler-2.3.6.jar > hive-storage-api-2.6.0.jar > hive-vector-code-gen-2.3.6.jar > javax.jdo-3.2.0-m3.jar > json-1.8.jar > transaction-api-1.1.jar > velocity-1.5.jar > {noformat} > More details: https://github.com/apache/spark/pull/21640#discussion_r321777658 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935498#comment-16935498 ] Burak Yavuz commented on SPARK-29197: - Hi Ido, Thanks for your interest. I already have a PR on it right now. It's not going to be a straight forward task, and requires some context and future plans around DataSource V2 as well. > Remove saveModeForDSV2 in DataFrameWriter > - > > Key: SPARK-29197 > URL: https://issues.apache.org/jira/browse/SPARK-29197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > It is very confusing that the default save mode is different between the > internal implementation of a Data source. The reason that we had to have > saveModeForDSV2 was that there was no easy way to check the existence of a > Table in DataSource v2. Now, we have catalogs for that. Therefore we should > be able to remove the different save modes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
[ https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935433#comment-16935433 ] Dongjoon Hyun commented on SPARK-29204: --- Thanks. Specifically, it's about deleting the view and the four jobs together; the followings. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-docs/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-package/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-publish/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-tag/ > Remove `Spark Release` Jenkins tab and its four jobs > > > Key: SPARK-29204 > URL: https://issues.apache.org/jira/browse/SPARK-29204 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Spark Release Jobs.png > > > Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we > keep them until now, it already became outdated because we are using Docker > `spark-rm` image. > !Spark Release Jobs.png! > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ > We had better remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
[ https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935432#comment-16935432 ] Sean Owen commented on SPARK-29204: --- Is this just a matter of deleting the view? I have permissions to delete it. I'm OK with doing so. > Remove `Spark Release` Jenkins tab and its four jobs > > > Key: SPARK-29204 > URL: https://issues.apache.org/jira/browse/SPARK-29204 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Spark Release Jobs.png > > > Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we > keep them until now, it already became outdated because we are using Docker > `spark-rm` image. > !Spark Release Jobs.png! > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ > We had better remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29084) Check method bytecode size in BenchmarkQueryTest
[ https://issues.apache.org/jira/browse/SPARK-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29084. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25788 [https://github.com/apache/spark/pull/25788] > Check method bytecode size in BenchmarkQueryTest > > > Key: SPARK-29084 > URL: https://issues.apache.org/jira/browse/SPARK-29084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29084) Check method bytecode size in BenchmarkQueryTest
[ https://issues.apache.org/jira/browse/SPARK-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29084: - Assignee: Takeshi Yamamuro > Check method bytecode size in BenchmarkQueryTest > > > Key: SPARK-29084 > URL: https://issues.apache.org/jira/browse/SPARK-29084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
[ https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935423#comment-16935423 ] Dongjoon Hyun commented on SPARK-29204: --- cc [~shaneknapp] and [~srowen] > Remove `Spark Release` Jenkins tab and its four jobs > > > Key: SPARK-29204 > URL: https://issues.apache.org/jira/browse/SPARK-29204 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Spark Release Jobs.png > > > Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we > keep them until now, it already became outdated because we are using Docker > `spark-rm` image. > !Spark Release Jobs.png! > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ > We had better remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
[ https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29204: -- Description: Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we keep them until now, it already became outdated because we are using Docker `spark-rm` image. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ We had better remove them. was: Since last two years, we didn't use `Spark Release` Jenkins jobs. It already became outdated because we are using Docker `spark-rm` image. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ > Remove `Spark Release` Jenkins tab and its four jobs > > > Key: SPARK-29204 > URL: https://issues.apache.org/jira/browse/SPARK-29204 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Spark Release Jobs.png > > > Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we > keep them until now, it already became outdated because we are using Docker > `spark-rm` image. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ > We had better remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
[ https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29204: -- Description: Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we keep them until now, it already became outdated because we are using Docker `spark-rm` image. !Spark Release Jobs.png! - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ We had better remove them. was: Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we keep them until now, it already became outdated because we are using Docker `spark-rm` image. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ We had better remove them. > Remove `Spark Release` Jenkins tab and its four jobs > > > Key: SPARK-29204 > URL: https://issues.apache.org/jira/browse/SPARK-29204 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Spark Release Jobs.png > > > Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we > keep them until now, it already became outdated because we are using Docker > `spark-rm` image. > !Spark Release Jobs.png! > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ > We had better remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
Dongjoon Hyun created SPARK-29204: - Summary: Remove `Spark Release` Jenkins tab and its four jobs Key: SPARK-29204 URL: https://issues.apache.org/jira/browse/SPARK-29204 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Attachments: Spark Release Jobs.png Since last two years, we didn't use `Spark Release` Jenkins jobs. It already became outdated because we are using Docker `spark-rm` image. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs
[ https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29204: -- Attachment: Spark Release Jobs.png > Remove `Spark Release` Jenkins tab and its four jobs > > > Key: SPARK-29204 > URL: https://issues.apache.org/jira/browse/SPARK-29204 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Spark Release Jobs.png > > > Since last two years, we didn't use `Spark Release` Jenkins jobs. It already > became outdated because we are using Docker `spark-rm` image. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28599) Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28599. --- Fix Version/s: 3.0.0 Assignee: Yuming Wang Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25892 > Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage > -- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.0.0 > > > Support sorting `Execution Time` and `Duration` columns for > ThriftServerSessionPage -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28599) Fix wrong column sorting in ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28599: -- Summary: Fix wrong column sorting in ThriftServerSessionPage (was: Support sorting for ThriftServerSessionPage) > Fix wrong column sorting in ThriftServerSessionPage > --- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > Support sorting `Execution Time` and `Duration` columns for > ThriftServerSessionPage -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28599) Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28599: -- Summary: Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage (was: Fix wrong column sorting in ThriftServerSessionPage) > Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage > -- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > Support sorting `Execution Time` and `Duration` columns for > ThriftServerSessionPage -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28599: -- Issue Type: Bug (was: Improvement) > Support sorting for ThriftServerSessionPage > --- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Support sorting `Execution Time` and `Duration` columns for > ThriftServerSessionPage -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28599: -- Priority: Minor (was: Major) > Support sorting for ThriftServerSessionPage > --- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > Support sorting `Execution Time` and `Duration` columns for > ThriftServerSessionPage -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29191) Add tag ExtendedSQLTest for SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29191: - Assignee: Dongjoon Hyun > Add tag ExtendedSQLTest for SQLQueryTestSuite > - > > Key: SPARK-29191 > URL: https://issues.apache.org/jira/browse/SPARK-29191 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29191) Add tag ExtendedSQLTest for SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29191. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25872 [https://github.com/apache/spark/pull/25872] > Add tag ExtendedSQLTest for SQLQueryTestSuite > - > > Key: SPARK-29191 > URL: https://issues.apache.org/jira/browse/SPARK-29191 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29162) Simplify NOT(isnull(x)) and NOT(isnotnull(x))
[ https://issues.apache.org/jira/browse/SPARK-29162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29162. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25878 [https://github.com/apache/spark/pull/25878] > Simplify NOT(isnull(x)) and NOT(isnotnull(x)) > - > > Key: SPARK-29162 > URL: https://issues.apache.org/jira/browse/SPARK-29162 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: angerszhu >Priority: Major > Fix For: 3.0.0 > > > I propose the following expression rewrite optimizations: > {code} > NOT isnull(x) -> isnotnull(x) > NOT isnotnull(x) -> isnull(x) > {code} > This might seem contrived, but I saw negated versions of these expressions > appear in a user-written query after that query had undergone optimization. > For example: > {code} > spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), > ("false", false), ("null", null))).write.parquet("/tmp/bools") > spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == > false)").explain > spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == > false)").explain(true) > == Parsed Logical Plan == > 'Filter NOT ('isnull('_2) OR ('_2 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Analyzed Logical Plan == > _1: string, _2: boolean > Filter NOT (isnull(_2#5) OR (_2#5 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Optimized Logical Plan == > Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Physical Plan == > *(1) Project [_1#4, _2#5] > +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) >+- *(1) ColumnarToRow > +- BatchScan[_1#4, _2#5] ParquetScan Location: > InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean> > {code} > This rewrite is also useful for query canonicalization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29162) Simplify NOT(isnull(x)) and NOT(isnotnull(x))
[ https://issues.apache.org/jira/browse/SPARK-29162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29162: - Assignee: angerszhu > Simplify NOT(isnull(x)) and NOT(isnotnull(x)) > - > > Key: SPARK-29162 > URL: https://issues.apache.org/jira/browse/SPARK-29162 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: angerszhu >Priority: Major > > I propose the following expression rewrite optimizations: > {code} > NOT isnull(x) -> isnotnull(x) > NOT isnotnull(x) -> isnull(x) > {code} > This might seem contrived, but I saw negated versions of these expressions > appear in a user-written query after that query had undergone optimization. > For example: > {code} > spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), > ("false", false), ("null", null))).write.parquet("/tmp/bools") > spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == > false)").explain > spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == > false)").explain(true) > == Parsed Logical Plan == > 'Filter NOT ('isnull('_2) OR ('_2 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Analyzed Logical Plan == > _1: string, _2: boolean > Filter NOT (isnull(_2#5) OR (_2#5 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Optimized Logical Plan == > Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) > +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools > == Physical Plan == > *(1) Project [_1#4, _2#5] > +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) >+- *(1) ColumnarToRow > +- BatchScan[_1#4, _2#5] ParquetScan Location: > InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean> > {code} > This rewrite is also useful for query canonicalization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29016) Update, fix LICENSE and NOTICE for Hive 2.3
[ https://issues.apache.org/jira/browse/SPARK-29016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29016: Description: Hive 2.3 newly added jars: {noformat} dropwizard-metrics-hadoop-metrics2-reporter.jar HikariCP-2.5.1.jar hive-common-2.3.6.jar hive-llap-common-2.3.6.jar hive-serde-2.3.6.jar hive-service-rpc-2.3.6.jar hive-shims-0.23-2.3.6.jar hive-shims-2.3.6.jar hive-shims-common-2.3.6.jar hive-shims-scheduler-2.3.6.jar hive-storage-api-2.6.0.jar hive-vector-code-gen-2.3.6.jar javax.jdo-3.2.0-m3.jar json-1.8.jar transaction-api-1.1.jar velocity-1.5.jar {noformat} More details: https://github.com/apache/spark/pull/21640#discussion_r321777658 was:More details: https://github.com/apache/spark/pull/21640#discussion_r321777658 > Update, fix LICENSE and NOTICE for Hive 2.3 > --- > > Key: SPARK-29016 > URL: https://issues.apache.org/jira/browse/SPARK-29016 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hive 2.3 newly added jars: > {noformat} > dropwizard-metrics-hadoop-metrics2-reporter.jar > HikariCP-2.5.1.jar > hive-common-2.3.6.jar > hive-llap-common-2.3.6.jar > hive-serde-2.3.6.jar > hive-service-rpc-2.3.6.jar > hive-shims-0.23-2.3.6.jar > hive-shims-2.3.6.jar > hive-shims-common-2.3.6.jar > hive-shims-scheduler-2.3.6.jar > hive-storage-api-2.6.0.jar > hive-vector-code-gen-2.3.6.jar > javax.jdo-3.2.0-m3.jar > json-1.8.jar > transaction-api-1.1.jar > velocity-1.5.jar > {noformat} > More details: https://github.com/apache/spark/pull/21640#discussion_r321777658 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception
[ https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935321#comment-16935321 ] Ido Michael commented on SPARK-28001: - I can take a look Can you please also post the dataset? Ido > Dataframe throws 'socket.timeout: timed out' exception > -- > > Key: SPARK-28001 > URL: https://issues.apache.org/jira/browse/SPARK-28001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz > RAM: 16 GB > OS: Windows 10 Enterprise 64-bit > Python: 3.7.2 > PySpark: 3.4.3 > Cluster manager: Spark Standalone >Reporter: Marius Stanescu >Priority: Critical > > I load data from Azure Table Storage, create a DataFrame and perform a couple > of operations via two user-defined functions, then call show() to display the > results. If I load a very small batch of items, like 5, everything is working > fine, but if I load a batch grater then 10 items from Azure Table Storage > then I get the 'socket.timeout: timed out' exception. > Here is the code: > > {code} > import time > import json > import requests > from requests.auth import HTTPBasicAuth > from azure.cosmosdb.table.tableservice import TableService > from azure.cosmosdb.table.models import Entity > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf, struct > from pyspark.sql.types import BooleanType > def main(): > batch_size = 25 > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > spark = SparkSession \ > .builder \ > .appName(agent_name) \ > .config("spark.sql.crossJoin.enabled", "true") \ > .getOrCreate() > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > continuation_token = None > while True: > messages = table_service.query_entities( > azure_table_name, > select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp", > num_results=batch_size, > marker=continuation_token, > timeout=60) > continuation_token = messages.next_marker > messages_list = list(messages) > > if not len(messages_list): > time.sleep(5) > pass > > messages_df = spark.createDataFrame(messages_list) > > register_records_df = messages_df \ > .withColumn('Registered', register_record('RowKey', > 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp')) > > only_registered_records_df = register_records_df \ > .filter(register_records_df.Registered == True) \ > .drop(register_records_df.Registered) > > update_message_status_df = only_registered_records_df \ > .withColumn('TableEntryDeleted', delete_table_entity('RowKey', > 'PartitionKey')) > > results_df = update_message_status_df.select( > update_message_status_df.RowKey, > update_message_status_df.PartitionKey, > update_message_status_df.TableEntryDeleted) > #results_df.explain() > results_df.show(n=batch_size, truncate=False) > @udf(returnType=BooleanType()) > def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp): > # call an API > try: > url = '{}/data/record/{}'.format('***', rowKey) > headers = { 'Content-type': 'application/json' } > response = requests.post( > url, > headers=headers, > auth=HTTPBasicAuth('***', '***'), > data=prepare_record_data(rowKey, partitionKey, > messageId, ownerSmtp, timestamp)) > > return bool(response) > except: > return False > def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, > timestamp): > record_data = { > "Title": messageId, > "Type": '***', > "Source": '***', > "Creator": ownerSmtp, > "Publisher": '***', > "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') > } > return json.dumps(record_data) > @udf(returnType=BooleanType()) > def delete_table_entity(row_key, partition_key): > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > try: > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > table_service.delete_entity(azure_table_name, partition_key, row_key) > return True > except: > return False > if __name__ == "__main__": > main() > {cod
[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935320#comment-16935320 ] Ido Michael commented on SPARK-29197: - I can take it, I could find the final class DataFrameWriter in spark-sql, but couldn't find saveModeForDSV2 in it, where is it? Do I need to remove all of the save modes? Ido > Remove saveModeForDSV2 in DataFrameWriter > - > > Key: SPARK-29197 > URL: https://issues.apache.org/jira/browse/SPARK-29197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > It is very confusing that the default save mode is different between the > internal implementation of a Data source. The reason that we had to have > saveModeForDSV2 was that there was no easy way to check the existence of a > Table in DataSource v2. Now, we have catalogs for that. Therefore we should > be able to remove the different save modes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API
[ https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935307#comment-16935307 ] Ido Michael commented on SPARK-29157: - It looks like https://issues.apache.org/jira/browse/SPARK-28612 was resolved. What there is to do here? Ido > DataSourceV2: Add DataFrameWriterV2 to Python API > - > > Key: SPARK-29157 > URL: https://issues.apache.org/jira/browse/SPARK-29157 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > After SPARK-28612 is committed, we need to add the corresponding PySpark API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table
[ https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935306#comment-16935306 ] Ido Michael commented on SPARK-29166: - I can take this if no one started to work on it? Ido > Add parameters to limit the number of dynamic partitions for data source table > -- > > Key: SPARK-29166 > URL: https://issues.apache.org/jira/browse/SPARK-29166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Dynamic partition in Hive table has some restrictions to limit the max number > of partitions. See > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts > It's very useful to prevent to create mistake partitions like ID. Also it can > protect the NameNode from mass RPC calls of creating. > Data source table also needs similar limitation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28599: Description: Support sorting `Execution Time` and `Duration` columns for ThriftServerSessionPage (was: SQLTab support pagination and sorting, but ThriftServerPage missing this feature.) > Support sorting for ThriftServerSessionPage > --- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Support sorting `Execution Time` and `Duration` columns for > ThriftServerSessionPage -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29201) Add Hadoop 2.6 combination to GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29201. -- Fix Version/s: 2.4.5 Resolution: Fixed Issue resolved by pull request 25886 [https://github.com/apache/spark/pull/25886] > Add Hadoop 2.6 combination to GitHub Action > --- > > Key: SPARK-29201 > URL: https://issues.apache.org/jira/browse/SPARK-29201 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.5 > > > This adds `Hadoop 2.6` combination to `branch-2.4` GitHub Action -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage
[ https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28599: Summary: Support sorting for ThriftServerSessionPage (was: Support pagination and sorting for ThriftServerPage) > Support sorting for ThriftServerSessionPage > --- > > Key: SPARK-28599 > URL: https://issues.apache.org/jira/browse/SPARK-28599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > SQLTab support pagination and sorting, but ThriftServerPage missing this > feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29203: Description: spark.sql.shuffle.partitions=200(default): {noformat} [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds) [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds) [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 763 milliseconds) {noformat} spark.sql.shuffle.partitions=5: {noformat} [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds) [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds) [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 360 milliseconds) {noformat} > Reduce shuffle partitions in SQLQueryTestSuite > -- > > Key: SPARK-29203 > URL: https://issues.apache.org/jira/browse/SPARK-29203 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > spark.sql.shuffle.partitions=200(default): > {noformat} > [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds) > [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds) > [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, > 763 milliseconds) > {noformat} > spark.sql.shuffle.partitions=5: > {noformat} > [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds) > [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds) > [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, > 360 milliseconds) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite
Yuming Wang created SPARK-29203: --- Summary: Reduce shuffle partitions in SQLQueryTestSuite Key: SPARK-29203 URL: https://issues.apache.org/jira/browse/SPARK-29203 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28141) Date type can not accept special values
[ https://issues.apache.org/jira/browse/SPARK-28141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28141: Assignee: Maxim Gekk > Date type can not accept special values > --- > > Key: SPARK-28141 > URL: https://issues.apache.org/jira/browse/SPARK-28141 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > > ||Input String||Valid Types||Description|| > |{{epoch}}|{{date}}|1970-01-01 00:00:00+00 (Unix system time zero)| > |{{now}}|{{date}}|current transaction's start time| > |{{today}}|{{date}}|midnight today| > |{{tomorrow}}|{{date}}|midnight tomorrow| > |{{yesterday}}|{{date}}|midnight yesterday| > https://www.postgresql.org/docs/12/datatype-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28141) Date type can not accept special values
[ https://issues.apache.org/jira/browse/SPARK-28141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28141. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25708 [https://github.com/apache/spark/pull/25708] > Date type can not accept special values > --- > > Key: SPARK-28141 > URL: https://issues.apache.org/jira/browse/SPARK-28141 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Input String||Valid Types||Description|| > |{{epoch}}|{{date}}|1970-01-01 00:00:00+00 (Unix system time zero)| > |{{now}}|{{date}}|current transaction's start time| > |{{today}}|{{date}}|midnight today| > |{{tomorrow}}|{{date}}|midnight tomorrow| > |{{yesterday}}|{{date}}|midnight yesterday| > https://www.postgresql.org/docs/12/datatype-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29200) Optimize `extract`/`date_part` for epoch
[ https://issues.apache.org/jira/browse/SPARK-29200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29200. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25881 [https://github.com/apache/spark/pull/25881] > Optimize `extract`/`date_part` for epoch > > > Key: SPARK-29200 > URL: https://issues.apache.org/jira/browse/SPARK-29200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The method `DateTimeUtils.getEpoch()` can be speeded up by avoiding decimal > operations and converting shifted by time zone timestamp to decimal at the > end. Results of `ExtractBenchmark` should be regenerated as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29200) Optimize `extract`/`date_part` for epoch
[ https://issues.apache.org/jira/browse/SPARK-29200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29200: Assignee: Maxim Gekk > Optimize `extract`/`date_part` for epoch > > > Key: SPARK-29200 > URL: https://issues.apache.org/jira/browse/SPARK-29200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > The method `DateTimeUtils.getEpoch()` can be speeded up by avoiding decimal > operations and converting shifted by time zone timestamp to decimal at the > end. Results of `ExtractBenchmark` should be regenerated as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org