[jira] [Commented] (SPARK-28121) String Functions: decode/encode can not accept 'escape' and 'hex' as charset

2019-09-22 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935586#comment-16935586
 ] 

jiaan.geng commented on SPARK-28121:


I write a test TransformTest as follows:
{code:java}
public class TransformTest {  
public static void main(String[] args) {  
String s = "1234567890";  
byte[] bytes = s.getBytes();   
System.out.println("16进制:"  + binary(bytes, 16)); 
System.exit(0);  
}  
  
public static String binary(byte[] bytes, int radix){  
return new BigInteger(1, bytes).toString(radix);// 这里的1代表正数  
}  
} 

// the result is 31323334353637383930{code}
The java code transform string to hex. the result is the same as the uncode 
function of Postgresql.

 

> String Functions: decode/encode can not accept 'escape' and 'hex' as charset
> 
>
> Key: SPARK-28121
> URL: https://issues.apache.org/jira/browse/SPARK-28121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> postgres=# select decode('1234567890','escape');
> decode
> 
> \x31323334353637383930
> (1 row)
> {noformat}
> {noformat}
> spark-sql> select decode('1234567890','escape');
> 19/06/20 01:57:33 ERROR SparkSQLDriver: Failed in [select 
> decode('1234567890','escape')]
> java.io.UnsupportedEncodingException: escape
>   at java.lang.StringCoding.decode(StringCoding.java:190)
>   at java.lang.String.(String.java:426)
>   at java.lang.String.(String.java:491)
> ...
> spark-sql> select decode('ff','hex');
> 19/08/16 21:44:55 ERROR SparkSQLDriver: Failed in [select decode('ff','hex')]
> java.io.UnsupportedEncodingException: hex
>   at java.lang.StringCoding.decode(StringCoding.java:190)
>   at java.lang.String.(String.java:426)
>   at java.lang.String.(String.java:491)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28121) String Functions: decode/encode can not accept 'escape' and 'hex' as charset

2019-09-22 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935582#comment-16935582
 ] 

jiaan.geng commented on SPARK-28121:


[~smilegator] I find the encode function of Postgresql exists some issue.

First, according to the description of encode function, we know this function 
could encode binary data into a textual representation. Supported formats are: 
{{base64}}, {{hex}}, {{escape}}. {{escape}} converts zero bytes and 
high-bit-set bytes to octal sequences ({{\}}_{{nnn}}_) and doubles backslashes.

But, I find the behavior is different as follows:

 
{code:java}
select encode(E'123//000456'::bytea, 'escape');
select encode('123//000456'::bytea, 'escape');
// all the result is '123//000456'
{code}
 

It seems no effective.

The encode function of MySql is used to encrypt and uncode used to decrypt.

The encode function of Vertica is used to compare.

No other mainstream database implements this function.

 

> String Functions: decode/encode can not accept 'escape' and 'hex' as charset
> 
>
> Key: SPARK-28121
> URL: https://issues.apache.org/jira/browse/SPARK-28121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> postgres=# select decode('1234567890','escape');
> decode
> 
> \x31323334353637383930
> (1 row)
> {noformat}
> {noformat}
> spark-sql> select decode('1234567890','escape');
> 19/06/20 01:57:33 ERROR SparkSQLDriver: Failed in [select 
> decode('1234567890','escape')]
> java.io.UnsupportedEncodingException: escape
>   at java.lang.StringCoding.decode(StringCoding.java:190)
>   at java.lang.String.(String.java:426)
>   at java.lang.String.(String.java:491)
> ...
> spark-sql> select decode('ff','hex');
> 19/08/16 21:44:55 ERROR SparkSQLDriver: Failed in [select decode('ff','hex')]
> java.io.UnsupportedEncodingException: hex
>   at java.lang.StringCoding.decode(StringCoding.java:190)
>   at java.lang.String.(String.java:426)
>   at java.lang.String.(String.java:491)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29205) Pyspark tests failed for suspected performance problem on ARM

2019-09-22 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935578#comment-16935578
 ] 

huangtianhua commented on SPARK-29205:
--

And we found there is a similar issue community faced before: 
[https://github.com/apache/spark/commit/ab76900fedc05df7080c9b6c81d65a3f260c1c26#diff-f7e50078760ce2d40f35e4c3b9112227,]
 if we increase the timeout the tests are pass. 

> Pyspark tests failed for suspected performance problem on ARM
> -
>
> Key: SPARK-29205
> URL: https://issues.apache.org/jira/browse/SPARK-29205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: OS: Ubuntu16.04
> Arch: aarch64
> Host: Virtual Machine
>Reporter: zhao bo
>Priority: Major
>
> We test the pyspark on ARM VM. But found some test fails, once we change the 
> source code to extend the wait time for making sure those test tasks had 
> finished, then the test will pass.
>  
> The affected test cases including:
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_parameter_convergence
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_convergence
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> The error message about above test fails:
> ==
> FAIL: test_parameter_convergence 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that the model parameters improve with streaming data.
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergen ce
>     self._eventually(condition, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
>     self.assertEqual(len(model_weights), len(batches))
> AssertionError: 6 != 10
>  
>  
> ==
> FAIL: test_convergence 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 292, in test_convergence
>     self._eventually(condition, 60.0, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 288, in condition
>     self.assertEqual(len(models), len(input_batches))
> AssertionError: 19 != 20
>  
> ==
> FAIL: test_parameter_accuracy 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 266, in test_parameter_accuracy
>     self._eventually(condition, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 263, in condition
>     self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.21309223935797794 != 0.1 within 1 places
>  
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressio

[jira] [Commented] (SPARK-29210) Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp

2019-09-22 Thread Sharanabasappa G Keriwaddi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935577#comment-16935577
 ] 

Sharanabasappa G Keriwaddi commented on SPARK-29210:


I will work on this

> Spark 3.0 Migration guide should contain the details of to_utc_timestamp 
> function and from_utc_timestamp
> 
>
> Key: SPARK-29210
> URL: https://issues.apache.org/jira/browse/SPARK-29210
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> to_utc_timestamp and from_utc_timestamp are deprecated in Spark 3.0. Details 
> should be contain in Spark 3.0 Spark Migration guide.
>  
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> from_utc_timestamp('2016-08-31', 'Asia/Seoul');
> Error: org.apache.spark.sql.AnalysisException: The from_utc_timestamp 
> function has been disabled since Spark 3.0.Set 
> spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
> line 1 pos 7 (state=,code=0)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> to_utc_timestamp('2016-08-31', 'Asia/Seoul');
> Error: org.apache.spark.sql.AnalysisException: The to_utc_timestamp function 
> has been disabled since Spark 3.0. Set 
> spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
> line 1 pos 7 (state=,code=0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29209) Print build environment variables to Github

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29209:

Description: This PR makes it support print AMPLAB_JENKINS_BUILD_TOOL, 
AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this test 
finished.  (was: When running tests for a pull request on Jenkins, you can add 
special phrases to the title of your pull request to change testing behavior. 
This includes:

[test-maven] - signals to test the pull request using maven
[test-hadoop2.7] - signals to test using Spark’s Hadoop 2.7 profile
[test-hadoop3.2] - signals to test using Spark’s Hadoop 3.2 profile
[test-hadoop3.2][test-java11] - signals to test using Spark’s Hadoop 3.2 
profile with JDK 11)

> Print build environment variables to Github
> ---
>
> Key: SPARK-29209
> URL: https://issues.apache.org/jira/browse/SPARK-29209
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This PR makes it support print AMPLAB_JENKINS_BUILD_TOOL, 
> AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this 
> test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29209) Print build environment variables to Github

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29209:

Description: Makes it support print AMPLAB_JENKINS_BUILD_TOOL, 
AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this test 
finished.  (was: This PR makes it support print AMPLAB_JENKINS_BUILD_TOOL, 
AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this test 
finished.)

> Print build environment variables to Github
> ---
>
> Key: SPARK-29209
> URL: https://issues.apache.org/jira/browse/SPARK-29209
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Makes it support print AMPLAB_JENKINS_BUILD_TOOL, 
> AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this 
> test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29210) Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp

2019-09-22 Thread ABHISHEK KUMAR GUPTA (Jira)
ABHISHEK KUMAR GUPTA created SPARK-29210:


 Summary: Spark 3.0 Migration guide should contain the details of 
to_utc_timestamp function and from_utc_timestamp
 Key: SPARK-29210
 URL: https://issues.apache.org/jira/browse/SPARK-29210
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


to_utc_timestamp and from_utc_timestamp are deprecated in Spark 3.0. Details 
should be contain in Spark 3.0 Spark Migration guide.

 

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
from_utc_timestamp('2016-08-31', 'Asia/Seoul');
Error: org.apache.spark.sql.AnalysisException: The from_utc_timestamp function 
has been disabled since Spark 3.0.Set spark.sql.legacy.utcTimestampFunc.enabled 
to true to enable this function.;; line 1 pos 7 (state=,code=0)
0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
to_utc_timestamp('2016-08-31', 'Asia/Seoul');
Error: org.apache.spark.sql.AnalysisException: The to_utc_timestamp function 
has been disabled since Spark 3.0. Set 
spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
line 1 pos 7 (state=,code=0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29209) Print build environment variables to Github

2019-09-22 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29209:
---

 Summary: Print build environment variables to Github
 Key: SPARK-29209
 URL: https://issues.apache.org/jira/browse/SPARK-29209
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang


When running tests for a pull request on Jenkins, you can add special phrases 
to the title of your pull request to change testing behavior. This includes:

[test-maven] - signals to test the pull request using maven
[test-hadoop2.7] - signals to test using Spark’s Hadoop 2.7 profile
[test-hadoop3.2] - signals to test using Spark’s Hadoop 3.2 profile
[test-hadoop3.2][test-java11] - signals to test using Spark’s Hadoop 3.2 
profile with JDK 11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29207) Document LIST JAR in SQL Reference

2019-09-22 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935574#comment-16935574
 ] 

Sandeep Katta commented on SPARK-29207:
---

List jar or List files are part of list command, you can use any one of the 
jira to fix this

> Document LIST JAR in SQL Reference
> --
>
> Key: SPARK-29207
> URL: https://issues.apache.org/jira/browse/SPARK-29207
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29208) Document LIST FILE in SQL Reference

2019-09-22 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935568#comment-16935568
 ] 

Huaxin Gao commented on SPARK-29208:


I will work on this

> Document LIST FILE in SQL Reference
> ---
>
> Key: SPARK-29208
> URL: https://issues.apache.org/jira/browse/SPARK-29208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29208) Document LIST FILE in SQL Reference

2019-09-22 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-29208:
--

 Summary: Document LIST FILE in SQL Reference
 Key: SPARK-29208
 URL: https://issues.apache.org/jira/browse/SPARK-29208
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29207) Document LIST JAR in SQL Reference

2019-09-22 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935567#comment-16935567
 ] 

Huaxin Gao commented on SPARK-29207:


I will work on this

> Document LIST JAR in SQL Reference
> --
>
> Key: SPARK-29207
> URL: https://issues.apache.org/jira/browse/SPARK-29207
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29207) Document LIST JAR in SQL Reference

2019-09-22 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-29207:
--

 Summary: Document LIST JAR in SQL Reference
 Key: SPARK-29207
 URL: https://issues.apache.org/jira/browse/SPARK-29207
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads

2019-09-22 Thread Min Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935556#comment-16935556
 ] 

Min Shen commented on SPARK-29206:
--

We initially tried an alternative approach to resolve this issue by 
implementing a custom Netty EventExecutorChooserFactory, so Spark shuffle Netty 
server can be a bit more intelligent at choosing a thread among an 
EventLoopGroup to be associated with a new channel.

In latest version of Netty 4.1, each (Nio|Epoll)EventLoop exposes information 
about its number of pending tasks and registered channels. We initially thought 
we could use these metrics to do better at load balancing so to avoid 
registering a channel with a busy EventLoop.

However, as we implemented this approach, we realized that the state of an 
EventLoop at channel registration time could be very different from when an RPC 
request from this channel is placed in the task queue of this EventLoop later. 
Since there is no way to precisely tell the state of an EventLoop in the 
future, we gave up on this approach.

> Number of shuffle Netty server threads should be a multiple of number of 
> chunk fetch handler threads
> 
>
> Key: SPARK-29206
> URL: https://issues.apache.org/jira/browse/SPARK-29206
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool 
> to handle the slow-to-process chunk fetch requests in order to improve the 
> responsiveness of shuffle service for RPC requests.
> Initially, we thought by making the number of Netty server threads larger 
> than the number of chunk fetch handler threads, it would reserve some threads 
> for RPC requests thus resolving the various RPC request timeout issues we 
> experienced previously. The solution worked in our cluster initially. 
> However, as the number of Spark applications in our cluster continues to 
> increase, we saw the RPC request (SASL authentication specifically) timeout 
> issue again:
> {noformat}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
>  {noformat}
> After further investigation, we realized that as the number of concurrent 
> clients connecting to a shuffle service increases, it becomes _VERY_ 
> important to configure the number of Netty server threads and number of chunk 
> fetch handler threads correctly. Specifically, the number of Netty server 
> threads needs to be a multiple of the number of chunk fetch handler threads. 
> The reason is explained in details below:
> When a channel is established on the Netty server, it is registered with both 
> the Netty server default EventLoopGroup and the chunk fetch handler 
> EventLoopGroup. Once registered, this channel sticks with a given thread in 
> both EventLoopGroups, i.e. all requests from this channel is going to be 
> handled by the same thread. Right now, Spark shuffle Netty server uses the 
> default Netty strategy to select a thread from a EventLoopGroup to be 
> associated with a new channel, which is simply round-robin (Netty's 
> DefaultEventExecutorChooserFactory).
> In SPARK-24355, with the introduced chunk fetch handler thread pool, all 
> chunk fetch requests from a given channel will be first added to the task 
> queue of the chunk fetch handler thread associated with that channel. When 
> the requests get processed, the chunk fetch request handler thread will 
> submit a task to the task queue of the Netty server thread that's also 
> associated with this channel. If the number of Netty server threads is not a 
> multiple of the number of chunk fetch handler threads, it would become a 
> problem when the server has a large number of concurrent connections.
> Assume we configure the number of Netty server threads as 40 and the 
> percentage of chunk fetch handler threads as 87, which leads

[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads

2019-09-22 Thread Min Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935552#comment-16935552
 ] 

Min Shen commented on SPARK-29206:
--

[~redsanket], [~tgraves],

Since you worked on committing the original patch, would appreciate your 
comments here.

> Number of shuffle Netty server threads should be a multiple of number of 
> chunk fetch handler threads
> 
>
> Key: SPARK-29206
> URL: https://issues.apache.org/jira/browse/SPARK-29206
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool 
> to handle the slow-to-process chunk fetch requests in order to improve the 
> responsiveness of shuffle service for RPC requests.
> Initially, we thought by making the number of Netty server threads larger 
> than the number of chunk fetch handler threads, it would reserve some threads 
> for RPC requests thus resolving the various RPC request timeout issues we 
> experienced previously. The solution worked in our cluster initially. 
> However, as the number of Spark applications in our cluster continues to 
> increase, we saw the RPC request (SASL authentication specifically) timeout 
> issue again:
> {noformat}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
>  {noformat}
> After further investigation, we realized that as the number of concurrent 
> clients connecting to a shuffle service increases, it becomes _VERY_ 
> important to configure the number of Netty server threads and number of chunk 
> fetch handler threads correctly. Specifically, the number of Netty server 
> threads needs to be a multiple of the number of chunk fetch handler threads. 
> The reason is explained in details below:
> When a channel is established on the Netty server, it is registered with both 
> the Netty server default EventLoopGroup and the chunk fetch handler 
> EventLoopGroup. Once registered, this channel sticks with a given thread in 
> both EventLoopGroups, i.e. all requests from this channel is going to be 
> handled by the same thread. Right now, Spark shuffle Netty server uses the 
> default Netty strategy to select a thread from a EventLoopGroup to be 
> associated with a new channel, which is simply round-robin (Netty's 
> DefaultEventExecutorChooserFactory).
> In SPARK-24355, with the introduced chunk fetch handler thread pool, all 
> chunk fetch requests from a given channel will be first added to the task 
> queue of the chunk fetch handler thread associated with that channel. When 
> the requests get processed, the chunk fetch request handler thread will 
> submit a task to the task queue of the Netty server thread that's also 
> associated with this channel. If the number of Netty server threads is not a 
> multiple of the number of chunk fetch handler threads, it would become a 
> problem when the server has a large number of concurrent connections.
> Assume we configure the number of Netty server threads as 40 and the 
> percentage of chunk fetch handler threads as 87, which leads to 35 chunk 
> fetch handler threads. Then according to the round-robin policy, channel 0, 
> 40, 80, 120, 160, 200, 240, and 280 will all be associated with the 1st Netty 
> server thread in the default EventLoopGroup. However, since the chunk fetch 
> handler thread pool only has 35 threads, out of these 8 channels, only 
> channel 0 and 280 will be associated with the same chunk fetch handler 
> thread. Thus, channel 0, 40, 80, 120, 160, 200, 240 will all be associated 
> with different chunk fetch handler threads but associated with the same Netty 
> server thread. This means, the 7 different chunk fetch handler threads 
> associated with these channels could potentially submit tasks to the task 
> queue of the same Netty server thread at the same time. This would 

[jira] [Created] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads

2019-09-22 Thread Min Shen (Jira)
Min Shen created SPARK-29206:


 Summary: Number of shuffle Netty server threads should be a 
multiple of number of chunk fetch handler threads
 Key: SPARK-29206
 URL: https://issues.apache.org/jira/browse/SPARK-29206
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Min Shen


In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool 
to handle the slow-to-process chunk fetch requests in order to improve the 
responsiveness of shuffle service for RPC requests.

Initially, we thought by making the number of Netty server threads larger than 
the number of chunk fetch handler threads, it would reserve some threads for 
RPC requests thus resolving the various RPC request timeout issues we 
experienced previously. The solution worked in our cluster initially. However, 
as the number of Spark applications in our cluster continues to increase, we 
saw the RPC request (SASL authentication specifically) timeout issue again:
{noformat}
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
waiting for task.
at 
org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
at 
org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
 {noformat}
After further investigation, we realized that as the number of concurrent 
clients connecting to a shuffle service increases, it becomes _VERY_ important 
to configure the number of Netty server threads and number of chunk fetch 
handler threads correctly. Specifically, the number of Netty server threads 
needs to be a multiple of the number of chunk fetch handler threads. The reason 
is explained in details below:

When a channel is established on the Netty server, it is registered with both 
the Netty server default EventLoopGroup and the chunk fetch handler 
EventLoopGroup. Once registered, this channel sticks with a given thread in 
both EventLoopGroups, i.e. all requests from this channel is going to be 
handled by the same thread. Right now, Spark shuffle Netty server uses the 
default Netty strategy to select a thread from a EventLoopGroup to be 
associated with a new channel, which is simply round-robin (Netty's 
DefaultEventExecutorChooserFactory).

In SPARK-24355, with the introduced chunk fetch handler thread pool, all chunk 
fetch requests from a given channel will be first added to the task queue of 
the chunk fetch handler thread associated with that channel. When the requests 
get processed, the chunk fetch request handler thread will submit a task to the 
task queue of the Netty server thread that's also associated with this channel. 
If the number of Netty server threads is not a multiple of the number of chunk 
fetch handler threads, it would become a problem when the server has a large 
number of concurrent connections.

Assume we configure the number of Netty server threads as 40 and the percentage 
of chunk fetch handler threads as 87, which leads to 35 chunk fetch handler 
threads. Then according to the round-robin policy, channel 0, 40, 80, 120, 160, 
200, 240, and 280 will all be associated with the 1st Netty server thread in 
the default EventLoopGroup. However, since the chunk fetch handler thread pool 
only has 35 threads, out of these 8 channels, only channel 0 and 280 will be 
associated with the same chunk fetch handler thread. Thus, channel 0, 40, 80, 
120, 160, 200, 240 will all be associated with different chunk fetch handler 
threads but associated with the same Netty server thread. This means, the 7 
different chunk fetch handler threads associated with these channels could 
potentially submit tasks to the task queue of the same Netty server thread at 
the same time. This would lead to 7 slow-to-process requests sitting in the 
task queue. If an RPC request is put in the task queue after these 7 requests, 
it is very likely to timeout.

In our cluster, the number of concurrent active connections to a shuffle 
service could go as high as 6K+ during peak. If the numbers of these thread 
pools are not configured correctly, our Spark applications are guaranteed to 
see SASL timeout issues when a shuffle service is dealing with a lot of 
incoming chunk fetch requests from many distinct clients, which lea

[jira] [Created] (SPARK-29205) Pyspark tests failed for suspected performance problem on ARM

2019-09-22 Thread zhao bo (Jira)
zhao bo created SPARK-29205:
---

 Summary: Pyspark tests failed for suspected performance problem on 
ARM
 Key: SPARK-29205
 URL: https://issues.apache.org/jira/browse/SPARK-29205
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
 Environment: OS: Ubuntu16.04

Arch: aarch64

Host: Virtual Machine
Reporter: zhao bo


We test the pyspark on ARM VM. But found some test fails, once we change the 
source code to extend the wait time for making sure those test tasks had 
finished, then the test will pass.

 

The affected test cases including:

pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_parameter_convergence

pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_convergence

pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy

pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

The error message about above test fails:
==
FAIL: test_parameter_convergence 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
Test that the model parameters improve with streaming data.
--
Traceback (most recent call last):
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 429, in test_parameter_convergen ce
    self._eventually(condition, catch_assertions=True)
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 74, in _eventually
    raise lastValue
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 65, in _eventually
    lastValue = condition()
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 425, in condition
    self.assertEqual(len(model_weights), len(batches))
AssertionError: 6 != 10
 
 
==
FAIL: test_convergence 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
--
Traceback (most recent call last):
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 292, in test_convergence
    self._eventually(condition, 60.0, catch_assertions=True)
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 74, in _eventually
    raise lastValue
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 65, in _eventually
    lastValue = condition()
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 288, in condition
    self.assertEqual(len(models), len(input_batches))
AssertionError: 19 != 20
 
==
FAIL: test_parameter_accuracy 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
--
Traceback (most recent call last):
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 266, in test_parameter_accuracy
    self._eventually(condition, catch_assertions=True)
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 74, in _eventually
    raise lastValue
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 65, in _eventually
    lastValue = condition()
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 263, in condition
    self.assertAlmostEqual(rel, 0.1, 1)
AssertionError: 0.21309223935797794 != 0.1 within 1 places
 
==
FAIL: test_training_and_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
--
Traceback (most recent call last):
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 367, in test_training_and_predic tion
    self._eventually(condition, timeout=60.0)
  File 
"/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 78, in _eventually
    % (timeout, lastValue))
AssertionError: Test failed due to timeout after 60 sec, with last condition 
returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0. 75, 0.74, 0.73, 0.69, 0.62, 
0.71,

[jira] [Updated] (SPARK-29016) Update LICENSE and NOTICE for Hive 2.3

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29016:

Summary: Update LICENSE and NOTICE for Hive 2.3  (was: Update, fix LICENSE 
and NOTICE for Hive 2.3)

> Update LICENSE and NOTICE for Hive 2.3
> --
>
> Key: SPARK-29016
> URL: https://issues.apache.org/jira/browse/SPARK-29016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 2.3 newly added jars:
> {noformat}
> dropwizard-metrics-hadoop-metrics2-reporter.jar
> HikariCP-2.5.1.jar
> hive-common-2.3.6.jar
> hive-llap-common-2.3.6.jar
> hive-serde-2.3.6.jar
> hive-service-rpc-2.3.6.jar
> hive-shims-0.23-2.3.6.jar
> hive-shims-2.3.6.jar
> hive-shims-common-2.3.6.jar
> hive-shims-scheduler-2.3.6.jar
> hive-storage-api-2.6.0.jar
> hive-vector-code-gen-2.3.6.jar
> javax.jdo-3.2.0-m3.jar
> json-1.8.jar
> transaction-api-1.1.jar
> velocity-1.5.jar
> {noformat}
> More details: https://github.com/apache/spark/pull/21640#discussion_r321777658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-22 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935498#comment-16935498
 ] 

Burak Yavuz commented on SPARK-29197:
-

Hi Ido,

Thanks for your interest. I already have a PR on it right now. It's not going 
to be a straight forward task, and requires some context and future plans 
around DataSource V2 as well.

> Remove saveModeForDSV2 in DataFrameWriter
> -
>
> Key: SPARK-29197
> URL: https://issues.apache.org/jira/browse/SPARK-29197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> It is very confusing that the default save mode is different between the 
> internal implementation of a Data source. The reason that we had to have 
> saveModeForDSV2 was that there was no easy way to check the existence of a 
> Table in DataSource v2. Now, we have catalogs for that. Therefore we should 
> be able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935433#comment-16935433
 ] 

Dongjoon Hyun commented on SPARK-29204:
---

Thanks. Specifically, it's about deleting the view and the four jobs together; 
the followings.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-docs/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-package/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-publish/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-tag/

> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
> keep them until now, it already became outdated because we are using Docker 
> `spark-rm` image.
>  !Spark Release Jobs.png! 
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
> We had better remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935432#comment-16935432
 ] 

Sean Owen commented on SPARK-29204:
---

Is this just a matter of deleting the view? I have permissions to delete it. 
I'm OK with doing so.

> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
> keep them until now, it already became outdated because we are using Docker 
> `spark-rm` image.
>  !Spark Release Jobs.png! 
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
> We had better remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29084) Check method bytecode size in BenchmarkQueryTest

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29084.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25788
[https://github.com/apache/spark/pull/25788]

> Check method bytecode size in BenchmarkQueryTest
> 
>
> Key: SPARK-29084
> URL: https://issues.apache.org/jira/browse/SPARK-29084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29084) Check method bytecode size in BenchmarkQueryTest

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29084:
-

Assignee: Takeshi Yamamuro

> Check method bytecode size in BenchmarkQueryTest
> 
>
> Key: SPARK-29084
> URL: https://issues.apache.org/jira/browse/SPARK-29084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935423#comment-16935423
 ] 

Dongjoon Hyun commented on SPARK-29204:
---

cc [~shaneknapp] and [~srowen]

> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
> keep them until now, it already became outdated because we are using Docker 
> `spark-rm` image.
>  !Spark Release Jobs.png! 
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
> We had better remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29204:
--
Description: 
Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
keep them until now, it already became outdated because we are using Docker 
`spark-rm` image.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/

We had better remove them.

  was:
Since last two years, we didn't use `Spark Release` Jenkins jobs. It already 
became outdated because we are using Docker `spark-rm` image.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/



> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
> keep them until now, it already became outdated because we are using Docker 
> `spark-rm` image.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
> We had better remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29204:
--
Description: 
Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
keep them until now, it already became outdated because we are using Docker 
`spark-rm` image.

 !Spark Release Jobs.png! 

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/

We had better remove them.

  was:
Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
keep them until now, it already became outdated because we are using Docker 
`spark-rm` image.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/

We had better remove them.


> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
> keep them until now, it already became outdated because we are using Docker 
> `spark-rm` image.
>  !Spark Release Jobs.png! 
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
> We had better remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29204:
-

 Summary: Remove `Spark Release` Jenkins tab and its four jobs
 Key: SPARK-29204
 URL: https://issues.apache.org/jira/browse/SPARK-29204
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun
 Attachments: Spark Release Jobs.png

Since last two years, we didn't use `Spark Release` Jenkins jobs. It already 
became outdated because we are using Docker `spark-rm` image.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29204:
--
Attachment: Spark Release Jobs.png

> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. It already 
> became outdated because we are using Docker `spark-rm` image.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28599) Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28599.
---
Fix Version/s: 3.0.0
 Assignee: Yuming Wang
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25892

> Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage
> --
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Support sorting `Execution Time` and `Duration` columns for 
> ThriftServerSessionPage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28599) Fix wrong column sorting in ThriftServerSessionPage

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28599:
--
Summary: Fix wrong column sorting in ThriftServerSessionPage  (was: Support 
sorting for ThriftServerSessionPage)

> Fix wrong column sorting in ThriftServerSessionPage
> ---
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Support sorting `Execution Time` and `Duration` columns for 
> ThriftServerSessionPage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28599) Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28599:
--
Summary: Fix `Execution Time` and `Duration` column sorting for 
ThriftServerSessionPage  (was: Fix wrong column sorting in 
ThriftServerSessionPage)

> Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage
> --
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Support sorting `Execution Time` and `Duration` columns for 
> ThriftServerSessionPage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28599:
--
Issue Type: Bug  (was: Improvement)

> Support sorting for ThriftServerSessionPage
> ---
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support sorting `Execution Time` and `Duration` columns for 
> ThriftServerSessionPage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28599:
--
Priority: Minor  (was: Major)

> Support sorting for ThriftServerSessionPage
> ---
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Support sorting `Execution Time` and `Duration` columns for 
> ThriftServerSessionPage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29191) Add tag ExtendedSQLTest for SQLQueryTestSuite

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29191:
-

Assignee: Dongjoon Hyun

> Add tag ExtendedSQLTest for SQLQueryTestSuite
> -
>
> Key: SPARK-29191
> URL: https://issues.apache.org/jira/browse/SPARK-29191
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29191) Add tag ExtendedSQLTest for SQLQueryTestSuite

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29191.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25872
[https://github.com/apache/spark/pull/25872]

> Add tag ExtendedSQLTest for SQLQueryTestSuite
> -
>
> Key: SPARK-29191
> URL: https://issues.apache.org/jira/browse/SPARK-29191
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29162) Simplify NOT(isnull(x)) and NOT(isnotnull(x))

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29162.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25878
[https://github.com/apache/spark/pull/25878]

> Simplify NOT(isnull(x)) and NOT(isnotnull(x))
> -
>
> Key: SPARK-29162
> URL: https://issues.apache.org/jira/browse/SPARK-29162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> I propose the following expression rewrite optimizations:
> {code}
> NOT isnull(x) -> isnotnull(x)
> NOT isnotnull(x)  -> isnull(x)
> {code}
> This might seem contrived, but I saw negated versions of these expressions 
> appear in a user-written query after that query had undergone optimization. 
> For example:
> {code}
> spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), 
> ("false", false), ("null", null))).write.parquet("/tmp/bools")
> spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
> false)").explain
> spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
> false)").explain(true)
> == Parsed Logical Plan ==
> 'Filter NOT ('isnull('_2) OR ('_2 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Analyzed Logical Plan ==
> _1: string, _2: boolean
> Filter NOT (isnull(_2#5) OR (_2#5 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Optimized Logical Plan ==
> Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Physical Plan ==
> *(1) Project [_1#4, _2#5]
> +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
>+- *(1) ColumnarToRow
>   +- BatchScan[_1#4, _2#5] ParquetScan Location: 
> InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
> {code}
> This rewrite is also useful for query canonicalization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29162) Simplify NOT(isnull(x)) and NOT(isnotnull(x))

2019-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29162:
-

Assignee: angerszhu

> Simplify NOT(isnull(x)) and NOT(isnotnull(x))
> -
>
> Key: SPARK-29162
> URL: https://issues.apache.org/jira/browse/SPARK-29162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: angerszhu
>Priority: Major
>
> I propose the following expression rewrite optimizations:
> {code}
> NOT isnull(x) -> isnotnull(x)
> NOT isnotnull(x)  -> isnull(x)
> {code}
> This might seem contrived, but I saw negated versions of these expressions 
> appear in a user-written query after that query had undergone optimization. 
> For example:
> {code}
> spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), 
> ("false", false), ("null", null))).write.parquet("/tmp/bools")
> spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
> false)").explain
> spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
> false)").explain(true)
> == Parsed Logical Plan ==
> 'Filter NOT ('isnull('_2) OR ('_2 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Analyzed Logical Plan ==
> _1: string, _2: boolean
> Filter NOT (isnull(_2#5) OR (_2#5 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Optimized Logical Plan ==
> Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Physical Plan ==
> *(1) Project [_1#4, _2#5]
> +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
>+- *(1) ColumnarToRow
>   +- BatchScan[_1#4, _2#5] ParquetScan Location: 
> InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
> {code}
> This rewrite is also useful for query canonicalization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29016) Update, fix LICENSE and NOTICE for Hive 2.3

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29016:

Description: 
Hive 2.3 newly added jars:
{noformat}
dropwizard-metrics-hadoop-metrics2-reporter.jar
HikariCP-2.5.1.jar
hive-common-2.3.6.jar
hive-llap-common-2.3.6.jar
hive-serde-2.3.6.jar
hive-service-rpc-2.3.6.jar
hive-shims-0.23-2.3.6.jar
hive-shims-2.3.6.jar
hive-shims-common-2.3.6.jar
hive-shims-scheduler-2.3.6.jar
hive-storage-api-2.6.0.jar
hive-vector-code-gen-2.3.6.jar
javax.jdo-3.2.0-m3.jar
json-1.8.jar
transaction-api-1.1.jar
velocity-1.5.jar
{noformat}


More details: https://github.com/apache/spark/pull/21640#discussion_r321777658


  was:More details: 
https://github.com/apache/spark/pull/21640#discussion_r321777658


> Update, fix LICENSE and NOTICE for Hive 2.3
> ---
>
> Key: SPARK-29016
> URL: https://issues.apache.org/jira/browse/SPARK-29016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 2.3 newly added jars:
> {noformat}
> dropwizard-metrics-hadoop-metrics2-reporter.jar
> HikariCP-2.5.1.jar
> hive-common-2.3.6.jar
> hive-llap-common-2.3.6.jar
> hive-serde-2.3.6.jar
> hive-service-rpc-2.3.6.jar
> hive-shims-0.23-2.3.6.jar
> hive-shims-2.3.6.jar
> hive-shims-common-2.3.6.jar
> hive-shims-scheduler-2.3.6.jar
> hive-storage-api-2.6.0.jar
> hive-vector-code-gen-2.3.6.jar
> javax.jdo-3.2.0-m3.jar
> json-1.8.jar
> transaction-api-1.1.jar
> velocity-1.5.jar
> {noformat}
> More details: https://github.com/apache/spark/pull/21640#discussion_r321777658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935321#comment-16935321
 ] 

Ido Michael commented on SPARK-28001:
-

I can take a look

Can you please also post the dataset?

 

Ido

> Dataframe throws 'socket.timeout: timed out' exception
> --
>
> Key: SPARK-28001
> URL: https://issues.apache.org/jira/browse/SPARK-28001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz
> RAM: 16 GB
> OS: Windows 10 Enterprise 64-bit
> Python: 3.7.2
> PySpark: 3.4.3
> Cluster manager: Spark Standalone
>Reporter: Marius Stanescu
>Priority: Critical
>
> I load data from Azure Table Storage, create a DataFrame and perform a couple 
> of operations via two user-defined functions, then call show() to display the 
> results. If I load a very small batch of items, like 5, everything is working 
> fine, but if I load a batch grater then 10 items from Azure Table Storage 
> then I get the 'socket.timeout: timed out' exception.
> Here is the code:
>  
> {code}
> import time
> import json
> import requests
> from requests.auth import HTTPBasicAuth
> from azure.cosmosdb.table.tableservice import TableService
> from azure.cosmosdb.table.models import Entity
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf, struct
> from pyspark.sql.types import BooleanType
> def main():
> batch_size = 25
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> spark = SparkSession \
> .builder \
> .appName(agent_name) \
> .config("spark.sql.crossJoin.enabled", "true") \
> .getOrCreate()
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> continuation_token = None
> while True:
> messages = table_service.query_entities(
> azure_table_name,
> select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp",
> num_results=batch_size,
> marker=continuation_token,
> timeout=60)
> continuation_token = messages.next_marker
> messages_list = list(messages)
> 
> if not len(messages_list):
> time.sleep(5)
> pass
> 
> messages_df = spark.createDataFrame(messages_list)
> 
> register_records_df = messages_df \
> .withColumn('Registered', register_record('RowKey', 
> 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp'))
> 
> only_registered_records_df = register_records_df \
> .filter(register_records_df.Registered == True) \
> .drop(register_records_df.Registered)
> 
> update_message_status_df = only_registered_records_df \
> .withColumn('TableEntryDeleted', delete_table_entity('RowKey', 
> 'PartitionKey'))
> 
> results_df = update_message_status_df.select(
> update_message_status_df.RowKey,
> update_message_status_df.PartitionKey,
> update_message_status_df.TableEntryDeleted)
> #results_df.explain()
> results_df.show(n=batch_size, truncate=False)
> @udf(returnType=BooleanType())
> def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
>   # call an API
> try:
>   url = '{}/data/record/{}'.format('***', rowKey)
>   headers = { 'Content-type': 'application/json' }
>   response = requests.post(
>   url,
>   headers=headers,
>   auth=HTTPBasicAuth('***', '***'),
>   data=prepare_record_data(rowKey, partitionKey, 
> messageId, ownerSmtp, timestamp))
> 
> return bool(response)
> except:
> return False
> def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, 
> timestamp):
> record_data = {
> "Title": messageId,
> "Type": '***',
> "Source": '***',
> "Creator": ownerSmtp,
> "Publisher": '***',
> "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
> }
> return json.dumps(record_data)
> @udf(returnType=BooleanType())
> def delete_table_entity(row_key, partition_key):
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> try:
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> table_service.delete_entity(azure_table_name, partition_key, row_key)
> return True
> except:
> return False
> if __name__ == "__main__":
> main()
> {cod

[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935320#comment-16935320
 ] 

Ido Michael commented on SPARK-29197:
-

I can take it,

 

I could find the final class DataFrameWriter in spark-sql, but couldn't find 
saveModeForDSV2 in it, where is it?

 

Do I need to remove all of the save modes?

 

Ido

> Remove saveModeForDSV2 in DataFrameWriter
> -
>
> Key: SPARK-29197
> URL: https://issues.apache.org/jira/browse/SPARK-29197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> It is very confusing that the default save mode is different between the 
> internal implementation of a Data source. The reason that we had to have 
> saveModeForDSV2 was that there was no easy way to check the existence of a 
> Table in DataSource v2. Now, we have catalogs for that. Therefore we should 
> be able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935307#comment-16935307
 ] 

Ido Michael commented on SPARK-29157:
-

It looks like https://issues.apache.org/jira/browse/SPARK-28612 was resolved.

 

What there is to do here?

Ido

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table

2019-09-22 Thread Ido Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935306#comment-16935306
 ] 

Ido Michael commented on SPARK-29166:
-

I can take this if no one started to work on it?

Ido

> Add parameters to limit the number of dynamic partitions for data source table
> --
>
> Key: SPARK-29166
> URL: https://issues.apache.org/jira/browse/SPARK-29166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Dynamic partition in Hive table has some restrictions to limit the max number 
> of partitions. See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts
> It's very useful to prevent to create mistake partitions like ID. Also it can 
> protect the NameNode from mass RPC calls of creating.
> Data source table also needs similar limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28599:

Description: Support sorting `Execution Time` and `Duration` columns for 
ThriftServerSessionPage  (was: SQLTab support pagination and sorting, but 
ThriftServerPage missing this feature.)

> Support sorting for ThriftServerSessionPage
> ---
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support sorting `Execution Time` and `Duration` columns for 
> ThriftServerSessionPage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29201) Add Hadoop 2.6 combination to GitHub Action

2019-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29201.
--
Fix Version/s: 2.4.5
   Resolution: Fixed

Issue resolved by pull request 25886
[https://github.com/apache/spark/pull/25886]

> Add Hadoop 2.6 combination to GitHub Action
> ---
>
> Key: SPARK-29201
> URL: https://issues.apache.org/jira/browse/SPARK-29201
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.5
>
>
> This adds `Hadoop 2.6` combination to `branch-2.4` GitHub Action



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28599) Support sorting for ThriftServerSessionPage

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28599:

Summary: Support sorting for ThriftServerSessionPage  (was: Support 
pagination and sorting for ThriftServerPage)

> Support sorting for ThriftServerSessionPage
> ---
>
> Key: SPARK-28599
> URL: https://issues.apache.org/jira/browse/SPARK-28599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQLTab support pagination and sorting, but ThriftServerPage missing this 
> feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite

2019-09-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29203:

Description: 
spark.sql.shuffle.partitions=200(default):
{noformat}
[info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
[info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
[info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 
763 milliseconds)
{noformat}


spark.sql.shuffle.partitions=5:
{noformat}
[info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
[info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
[info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 
360 milliseconds)
{noformat}

> Reduce shuffle partitions in SQLQueryTestSuite
> --
>
> Key: SPARK-29203
> URL: https://issues.apache.org/jira/browse/SPARK-29203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> spark.sql.shuffle.partitions=200(default):
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 
> 763 milliseconds)
> {noformat}
> spark.sql.shuffle.partitions=5:
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 
> 360 milliseconds)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite

2019-09-22 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29203:
---

 Summary: Reduce shuffle partitions in SQLQueryTestSuite
 Key: SPARK-29203
 URL: https://issues.apache.org/jira/browse/SPARK-29203
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28141) Date type can not accept special values

2019-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28141:


Assignee: Maxim Gekk

> Date type can not accept special values
> ---
>
> Key: SPARK-28141
> URL: https://issues.apache.org/jira/browse/SPARK-28141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
>
> ||Input String||Valid Types||Description||
> |{{epoch}}|{{date}}|1970-01-01 00:00:00+00 (Unix system time zero)|
> |{{now}}|{{date}}|current transaction's start time|
> |{{today}}|{{date}}|midnight today|
> |{{tomorrow}}|{{date}}|midnight tomorrow|
> |{{yesterday}}|{{date}}|midnight yesterday|
> https://www.postgresql.org/docs/12/datatype-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28141) Date type can not accept special values

2019-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28141.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25708
[https://github.com/apache/spark/pull/25708]

> Date type can not accept special values
> ---
>
> Key: SPARK-28141
> URL: https://issues.apache.org/jira/browse/SPARK-28141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Input String||Valid Types||Description||
> |{{epoch}}|{{date}}|1970-01-01 00:00:00+00 (Unix system time zero)|
> |{{now}}|{{date}}|current transaction's start time|
> |{{today}}|{{date}}|midnight today|
> |{{tomorrow}}|{{date}}|midnight tomorrow|
> |{{yesterday}}|{{date}}|midnight yesterday|
> https://www.postgresql.org/docs/12/datatype-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29200) Optimize `extract`/`date_part` for epoch

2019-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29200.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25881
[https://github.com/apache/spark/pull/25881]

> Optimize `extract`/`date_part` for epoch
> 
>
> Key: SPARK-29200
> URL: https://issues.apache.org/jira/browse/SPARK-29200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The method `DateTimeUtils.getEpoch()` can be speeded up by avoiding decimal 
> operations and converting shifted by time zone timestamp to decimal at the 
> end. Results of `ExtractBenchmark` should be regenerated as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29200) Optimize `extract`/`date_part` for epoch

2019-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29200:


Assignee: Maxim Gekk

> Optimize `extract`/`date_part` for epoch
> 
>
> Key: SPARK-29200
> URL: https://issues.apache.org/jira/browse/SPARK-29200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The method `DateTimeUtils.getEpoch()` can be speeded up by avoiding decimal 
> operations and converting shifted by time zone timestamp to decimal at the 
> end. Results of `ExtractBenchmark` should be regenerated as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org