date:20190830

[jira] [Commented] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0

2019-08-30 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920028#comment-16920028
 ] 

Hyukjin Kwon commented on SPARK-28759:
--

(let me turn this back to SPARK-24417 since now 4.2.0 has 
https://github.com/davidB/scala-maven-plugin/pull/358 fix)

> Upgrade scala-maven-plugin to 4.2.0
> ---
>
> Key: SPARK-28759
> URL: https://issues.apache.org/jira/browse/SPARK-28759
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28759:
-
Parent: SPARK-24417
Issue Type: Sub-task  (was: Improvement)

> Upgrade scala-maven-plugin to 4.2.0
> ---
>
> Key: SPARK-28759
> URL: https://issues.apache.org/jira/browse/SPARK-28759
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920018#comment-16920018
 ] 

Jungtaek Lim edited comment on SPARK-28025 at 8/31/19 4:30 AM:
---

FYI, I just submitted a patch for HADOOP-16255. Hope we can get rid of 
workaround sooner.


was (Author: kabhwan):
FYI, I just submitted a patch for HADOOP-16255--. Hope we can get rid of 
workaround sooner.

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920018#comment-16920018
 ] 

Jungtaek Lim edited comment on SPARK-28025 at 8/31/19 4:30 AM:
---

FYI, I just submitted a patch for HADOOP-16255--. Hope we can get rid of 
workaround sooner.


was (Author: kabhwan):
FYI, I just submitted a patch for HADOOP-16225. Hope we can get rid of 
workaround sooner.

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920018#comment-16920018
 ] 

Jungtaek Lim commented on SPARK-28025:
--

FYI, I just submitted a patch for HADOOP-16225. Hope we can get rid of 
workaround sooner.

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28770) Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression failed

2019-08-30 Thread zhao bo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920009#comment-16920009
 ] 

zhao bo commented on SPARK-28770:
-

Thanks Lim, yeah, we also found that most test jobs will pass towards these 
tests. For us, It's hard to say this issue is trully exist on X86, but on ARM, 
we failed every time.

 

Hope team can see what happened. What we do on ARM, is just revert the commit  
[https://github.com/apache/spark/pull/23767], then all pass.

> Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression 
> failed
> ---
>
> Key: SPARK-28770
> URL: https://issues.apache.org/jira/browse/SPARK-28770
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Community jenkins and our arm testing instance.
>Reporter: huangtianhua
>Priority: Major
>
> Test
> org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with 
> compression is failed  see 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/]
>  
> And also the test is failed on arm instance, I sent email to spark-dev 
> before, and we suspect there is something related with the commit 
> [https://github.com/apache/spark/pull/23767], we tried to revert it and the 
> tests are passed:
> ReplayListenerSuite:
>        - ...
>        - End-to-end replay *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>        - End-to-end replay with compression *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) 
>  
> Not sure what's wrong, hope someone can help to figure it out, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28937) Improve error reporting in Spark Secrets Test Suite

2019-08-30 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919991#comment-16919991
 ] 

holdenk commented on SPARK-28937:
-

I'm working on this

> Improve error reporting in Spark Secrets Test Suite
> ---
>
> Key: SPARK-28937
> URL: https://issues.apache.org/jira/browse/SPARK-28937
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> Right now most the checks for the Secrets Test suite are done inside an 
> eventually condition meaning when they fail they fail with a last exception 
> that they can not connect to the pod, this can mask the actual failure.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28937) Improve error reporting in Spark Secrets Test Suite

2019-08-30 Thread holdenk (Jira)

holdenk created SPARK-28937:
---

 Summary: Improve error reporting in Spark Secrets Test Suite
 Key: SPARK-28937
 URL: https://issues.apache.org/jira/browse/SPARK-28937
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Right now most the checks for the Secrets Test suite are done inside an 
eventually condition meaning when they fail they fail with a last exception 
that they can not connect to the pod, this can mask the actual failure.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28936) Simplify Spark K8s tests by replacing race condition during command execution

2019-08-30 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919990#comment-16919990
 ] 

holdenk commented on SPARK-28936:
-

I'm working on this.

> Simplify Spark K8s tests by replacing race condition during command execution
> -
>
> Key: SPARK-28936
> URL: https://issues.apache.org/jira/browse/SPARK-28936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
>
> Currently our command execution for Spark Kubernetes integration tests 
> depends on a Thread.sleep which sometimes doesn't wait long enough. This 
> normally doesn't show up because we automatically retry the the commands 
> inside of an eventually, but on some machines may result in flaky tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28936) Simplify Spark K8s tests by replacing race condition during command execution

2019-08-30 Thread holdenk (Jira)

holdenk created SPARK-28936:
---

 Summary: Simplify Spark K8s tests by replacing race condition 
during command execution
 Key: SPARK-28936
 URL: https://issues.apache.org/jira/browse/SPARK-28936
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Currently our command execution for Spark Kubernetes integration tests depends 
on a Thread.sleep which sometimes doesn't wait long enough. This normally 
doesn't show up because we automatically retry the the commands inside of an 
eventually, but on some machines may result in flaky tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28770) Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression failed

2019-08-30 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919982#comment-16919982
 ] 

Jungtaek Lim edited comment on SPARK-28770 at 8/31/19 12:44 AM:


Just hit again.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109965/testReport]

> we suspect there is something related with the commit 
>[https://github.com/apache/spark/pull/23767], we tried to revert it and the 
>tests are passed:

It's not occurred frequently, so you may need to run at least 100 times to make 
sure reverting would help.


was (Author: kabhwan):
Just hit again.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109965/testReport]

> we suspect there is something related with the commit 
>[https://github.com/apache/spark/pull/23767], we tried to revert it and the 
>tests are passed:

It's not occurred frequently, so you may need to run 100 times to make sure 
reverting would help.

> Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression 
> failed
> ---
>
> Key: SPARK-28770
> URL: https://issues.apache.org/jira/browse/SPARK-28770
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Community jenkins and our arm testing instance.
>Reporter: huangtianhua
>Priority: Major
>
> Test
> org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with 
> compression is failed  see 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/]
>  
> And also the test is failed on arm instance, I sent email to spark-dev 
> before, and we suspect there is something related with the commit 
> [https://github.com/apache/spark/pull/23767], we tried to revert it and the 
> tests are passed:
> ReplayListenerSuite:
>        - ...
>        - End-to-end replay *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>        - End-to-end replay with compression *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) 
>  
> Not sure what's wrong, hope someone can help to figure it out, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28770) Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression failed

2019-08-30 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919982#comment-16919982
 ] 

Jungtaek Lim commented on SPARK-28770:
--

Just hit again.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109965/testReport]

> we suspect there is something related with the commit 
>[https://github.com/apache/spark/pull/23767], we tried to revert it and the 
>tests are passed:

It's not occurred frequently, so you may need to run 100 times to make sure 
reverting would help.

> Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression 
> failed
> ---
>
> Key: SPARK-28770
> URL: https://issues.apache.org/jira/browse/SPARK-28770
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Community jenkins and our arm testing instance.
>Reporter: huangtianhua
>Priority: Major
>
> Test
> org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with 
> compression is failed  see 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/]
>  
> And also the test is failed on arm instance, I sent email to spark-dev 
> before, and we suspect there is something related with the commit 
> [https://github.com/apache/spark/pull/23767], we tried to revert it and the 
> tests are passed:
> ReplayListenerSuite:
>        - ...
>        - End-to-end replay *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>        - End-to-end replay with compression *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) 
>  
> Not sure what's wrong, hope someone can help to figure it out, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-08-30 Thread Liang-Chi Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919975#comment-16919975
 ] 

Liang-Chi Hsieh commented on SPARK-28935:
-

Thanks for pinging me! I will look into this.

> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 
>  
> !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28926) CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-30 Thread Liang-Chi Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-28926.
-
Resolution: Duplicate

I think this is duplicate to SPARK-28927.

> CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for 
> datasets  with 12 billion instances
> 
>
> Key: SPARK-28926
> URL: https://issues.apache.org/jira/browse/SPARK-28926
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
> val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
>   .groupByKey().zipWithIndex()
>   .persist(StorageLevel.MEMORY_AND_DISK_SER)
> val predataUser = 
> predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
>

[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-08-30 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919972#comment-16919972
 ] 

Xiao Li commented on SPARK-28935:
-

cc [~viirya] Are you interested on this? You added a few metrics before. Maybe 
you are the best person to deliver this. 

> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-08-30 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28935:

Description: 
[https://github.com/apache/spark/pull/25349] shows the query plans but it does 
not describe the meaning of each metric in the plan. For end users, they might 
not understand the meaning of the metrics we output. 

 

!https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!

  was:[https://github.com/apache/spark/pull/25349] shows the query plans but it 
does not describe the meaning of each metric in the plan. For end users, they 
might not understand the meaning of the metrics we output. 


> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 
>  
> !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-08-30 Thread Xiao Li (Jira)

Xiao Li created SPARK-28935:
---

 Summary: Document SQL metrics for Details for Query Plan
 Key: SPARK-28935
 URL: https://issues.apache.org/jira/browse/SPARK-28935
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Xiao Li


[https://github.com/apache/spark/pull/25349] shows the query plans but it does 
not describe the meaning of each metric in the plan. For end users, they might 
not understand the meaning of the metrics we output. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28934:
--
Reporter: Xiao Li  (was: Dongjoon Hyun)

> Add `spark.sql.compatiblity.mode`
> -
>
> Key: SPARK-28934
> URL: https://issues.apache.org/jira/browse/SPARK-28934
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
> or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
>  
> Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
> `spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2019-08-30 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-28934:
-

 Summary: Add `spark.sql.compatiblity.mode`
 Key: SPARK-28934
 URL: https://issues.apache.org/jira/browse/SPARK-28934
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
 
Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
`spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-08-30 Thread Liang-Chi Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-28933:
---

Assignee: Liang-Chi Hsieh

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-08-30 Thread Liang-Chi Hsieh (Jira)

Liang-Chi Hsieh created SPARK-28933:
---

 Summary: Reduce unnecessary shuffle in ALS when initializing 
factors
 Key: SPARK-28933
 URL: https://issues.apache.org/jira/browse/SPARK-28933
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
current {{map}}, so we can preserve existing partition of the RDD of 
{{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. We 
don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28932) Maven install fails on JDK11

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28932:
--
Component/s: (was: Spark Core)
 Build

> Maven install fails on JDK11
> 
>
> Key: SPARK-28932
> URL: https://issues.apache.org/jira/browse/SPARK-28932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>
> {code}
> mvn clean install -pl common/network-common -DskipTests
> error: fatal error: object scala in compiler mirror not found.
> one error found
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28932) Maven install fails on JDK11

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28932:
-

Target Version/s: 3.0.0
Assignee: Dongjoon Hyun

> Maven install fails on JDK11
> 
>
> Key: SPARK-28932
> URL: https://issues.apache.org/jira/browse/SPARK-28932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>
> {code}
> mvn clean install -pl common/network-common -DskipTests
> error: fatal error: object scala in compiler mirror not found.
> one error found
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28804) Document DESCRIBE QUERY in SQL Reference.

2019-08-30 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28804.
-
Fix Version/s: 3.0.0
 Assignee: Dilip Biswal
   Resolution: Fixed

> Document DESCRIBE QUERY in SQL Reference.
> -
>
> Key: SPARK-28804
> URL: https://issues.apache.org/jira/browse/SPARK-28804
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26046) Add a way for StreamingQueryManager to remove all listeners

2019-08-30 Thread Mukul Murthy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Murthy updated SPARK-26046:
-
Description: StreamingQueryManager should have a way to clear out all 
listeners. There's addListener(listener) and removeListener(listener), but not 
removeAllListeners. We should expose a new method -removeAllListeners() that 
calls listenerBus.removeAllListeners (added here: 
[https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3])-
 listListeners() that can be used to remove listeners.  (was: 
StreamingQueryManager should have a way to clear out all listeners. There's 
addListener(listener) and removeListener(listener), but not removeAllListeners. 
We should expose a new method removeAllListeners() that calls 
listenerBus.removeAllListeners (added here: 
[https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3]).
 )

> Add a way for StreamingQueryManager to remove all listeners
> ---
>
> Key: SPARK-26046
> URL: https://issues.apache.org/jira/browse/SPARK-26046
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> StreamingQueryManager should have a way to clear out all listeners. There's 
> addListener(listener) and removeListener(listener), but not 
> removeAllListeners. We should expose a new method -removeAllListeners() that 
> calls listenerBus.removeAllListeners (added here: 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3])-
>  listListeners() that can be used to remove listeners.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28932) Maven install fails on JDK11

2019-08-30 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-28932:
-

 Summary: Maven install fails on JDK11
 Key: SPARK-28932
 URL: https://issues.apache.org/jira/browse/SPARK-28932
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


{code}
mvn clean install -pl common/network-common -DskipTests

error: fatal error: object scala in compiler mirror not found.
one error found
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26046) Add a way for StreamingQueryManager to remove all listeners

2019-08-30 Thread Mukul Murthy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Murthy reopened SPARK-26046:
--

>From some other discussions I've had, I actually think it's a reasonable to 
>have a way to remove all listeners. I don't think it should be a 
>removeAllListeners API, as originally discussed, but StreamingQueryManager 
>could have a listListeners API which the caller could then choose to use to 
>remove each listener manually. 

> Add a way for StreamingQueryManager to remove all listeners
> ---
>
> Key: SPARK-26046
> URL: https://issues.apache.org/jira/browse/SPARK-26046
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> StreamingQueryManager should have a way to clear out all listeners. There's 
> addListener(listener) and removeListener(listener), but not 
> removeAllListeners. We should expose a new method removeAllListeners() that 
> calls listenerBus.removeAllListeners (added here: 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3]).
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28894) Jenkins does not report test results of SQLQueryTestSuite in Jenkins

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28894.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25630
[https://github.com/apache/spark/pull/25630]

> Jenkins does not report test results of SQLQueryTestSuite in Jenkins
> 
>
> Key: SPARK-28894
> URL: https://issues.apache.org/jira/browse/SPARK-28894
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/
> We don't know which file has an error before reading the logs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28894) Jenkins does not report test results of SQLQueryTestSuite in Jenkins

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28894:
-

Assignee: Hyukjin Kwon

> Jenkins does not report test results of SQLQueryTestSuite in Jenkins
> 
>
> Key: SPARK-28894
> URL: https://issues.apache.org/jira/browse/SPARK-28894
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/
> We don't know which file has an error before reading the logs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-28921:
---
Affects Version/s: 2.3.3

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Paul Schweigert
>Priority: Critical
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14

2019-08-30 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919907#comment-16919907
 ] 

Andy Grove edited comment on SPARK-28925 at 8/30/19 9:54 PM:
-

This also impacts Spark 2.3.3 on EKS 1.11 due to security patches that were 
rolled out in the past week.
{code:java}
Server Version: version.Info{Major:"1", Minor:"11+", 
GitVersion:"v1.11.10-eks-7f15cc", 
GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", 
BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", 
Platform:"linux/amd64"} {code}
 


was (Author: andygrove):
This also impacts Spark 2.3.3

> Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 
> 1.14
> 
>
> Key: SPARK-28925
> URL: https://issues.apache.org/jira/browse/SPARK-28925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Eric
>Priority: Minor
>
> Hello,
> If you use Spark with Kubernetes 1.13 or 1.14 you will see this error:
> {code:java}
> {"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": 
> "org.apache.spark.internal.Logging", 
> "thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to 
> request 1 executors from Kubernetes."}
> {"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": 
> "io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", 
> "thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: 
> HTTP 403, Status: 403 - "}
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> {code}
> Apparently the bug is fixed here: 
> [https://github.com/fabric8io/kubernetes-client/pull/1669]
> We have currently compiled Spark source code with Kubernetes-client 4.4.2 and 
> it's working great on our cluster. We are using Kubernetes 1.13.10.
>  
> Could it be possible to update that dependency version?
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14

2019-08-30 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919907#comment-16919907
 ] 

Andy Grove edited comment on SPARK-28925 at 8/30/19 9:55 PM:
-

This also impacts Spark 2.3.3 on EKS 1.11 due to security patches that were 
rolled out in the past week.
{code:java}
Server Version: version.Info{Major:"1", Minor:"11+", 
GitVersion:"v1.11.10-eks-7f15cc", 
GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", 
BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", 
Platform:"linux/amd64"} {code}
 I experimented with replacing {{kubernetes-client.jar}} with version 4.4.2 and 
it did resolve this issue, but caused other issues, so isn't a real option for 
a workaround for my use case.


was (Author: andygrove):
This also impacts Spark 2.3.3 on EKS 1.11 due to security patches that were 
rolled out in the past week.
{code:java}
Server Version: version.Info{Major:"1", Minor:"11+", 
GitVersion:"v1.11.10-eks-7f15cc", 
GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", 
BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", 
Platform:"linux/amd64"} {code}
 

> Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 
> 1.14
> 
>
> Key: SPARK-28925
> URL: https://issues.apache.org/jira/browse/SPARK-28925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Eric
>Priority: Minor
>
> Hello,
> If you use Spark with Kubernetes 1.13 or 1.14 you will see this error:
> {code:java}
> {"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": 
> "org.apache.spark.internal.Logging", 
> "thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to 
> request 1 executors from Kubernetes."}
> {"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": 
> "io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", 
> "thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: 
> HTTP 403, Status: 403 - "}
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> {code}
> Apparently the bug is fixed here: 
> [https://github.com/fabric8io/kubernetes-client/pull/1669]
> We have currently compiled Spark source code with Kubernetes-client 4.4.2 and 
> it's working great on our cluster. We are using Kubernetes 1.13.10.
>  
> Could it be possible to update that dependency version?
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14

2019-08-30 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-28925:
---
Affects Version/s: 2.3.3

> Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 
> 1.14
> 
>
> Key: SPARK-28925
> URL: https://issues.apache.org/jira/browse/SPARK-28925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Eric
>Priority: Minor
>
> Hello,
> If you use Spark with Kubernetes 1.13 or 1.14 you will see this error:
> {code:java}
> {"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": 
> "org.apache.spark.internal.Logging", 
> "thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to 
> request 1 executors from Kubernetes."}
> {"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": 
> "io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", 
> "thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: 
> HTTP 403, Status: 403 - "}
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> {code}
> Apparently the bug is fixed here: 
> [https://github.com/fabric8io/kubernetes-client/pull/1669]
> We have currently compiled Spark source code with Kubernetes-client 4.4.2 and 
> it's working great on our cluster. We are using Kubernetes 1.13.10.
>  
> Could it be possible to update that dependency version?
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14

2019-08-30 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919907#comment-16919907
 ] 

Andy Grove commented on SPARK-28925:


This also impacts Spark 2.3.3

> Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 
> 1.14
> 
>
> Key: SPARK-28925
> URL: https://issues.apache.org/jira/browse/SPARK-28925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Eric
>Priority: Minor
>
> Hello,
> If you use Spark with Kubernetes 1.13 or 1.14 you will see this error:
> {code:java}
> {"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": 
> "org.apache.spark.internal.Logging", 
> "thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to 
> request 1 executors from Kubernetes."}
> {"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": 
> "io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", 
> "thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: 
> HTTP 403, Status: 403 - "}
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> {code}
> Apparently the bug is fixed here: 
> [https://github.com/fabric8io/kubernetes-client/pull/1669]
> We have currently compiled Spark source code with Kubernetes-client 4.4.2 and 
> it's working great on our cluster. We are using Kubernetes 1.13.10.
>  
> Could it be possible to update that dependency version?
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21181) Suppress memory leak errors reported by netty

2019-08-30 Thread Thangamani Murugasamy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919903#comment-16919903
 ] 

Thangamani Murugasamy commented on SPARK-21181:
---

I have same problem in Spark 2.3

 

RROR util.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html 
for more information.
Recent access records:
[Stage 0:===> (25 + 5) / 
30]19/08/30 16:39:07 ERROR datasources.FileFormatWriter: Aborting job null.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
    at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
    at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
    at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

> Suppress memory leak errors reported by netty
> -
>
> Key: SPARK-21181
> URL: https://issues.apache.org/jira/browse/SPARK-21181
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Dhruve Ashar
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.1.2, 2.2.0, 2.3.0
>
>
> We are seeing netty report memory leak erros like the one below after 
> switching to 2.1. 
> {code}
> ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
> it's garbage-collected. Enable advanced leak reporting to find out where the 
> leak occurred. To enable advanced leak reporting, specify the JVM option 
> '-Dio.netty.leakDetection.level=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> {code}
> Looking a bit deeper, Spark is not leaking any memory here, but it is 
> confusing for the user to see the error message in the driver logs. 
> After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
> SparkSaslServer to be the source of these leaks.
> Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3

2019-08-30 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919896#comment-16919896
 ] 

Dongjoon Hyun commented on SPARK-28891:
---

Since 2.3.4 vote passed, this is merged to `branch-2.3` as a final commit. 
`branch-2.3` is now locked since it becomes EOL.

> do-release-docker.sh in master does not work for branch-2.3
> ---
>
> Key: SPARK-28891
> URL: https://issues.apache.org/jira/browse/SPARK-28891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.4
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.4
>
>
> According to [~maropu], 
> [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
>  in master branch worked for 2.3.3 release for branch-2.3.
> After updates in [this PR|https://github.com/apache/spark/pull/23098], 
> {{do-release-docker.sh}} does not work for branch-2.3 now as shown:
> {code}
> ...
> Checked out revision 35358.
> Copying release tarballs
> cp: cannot stat 'pyspark-*': No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28891:
-

Assignee: Kazuaki Ishizaki

> do-release-docker.sh in master does not work for branch-2.3
> ---
>
> Key: SPARK-28891
> URL: https://issues.apache.org/jira/browse/SPARK-28891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.4
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> According to [~maropu], 
> [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
>  in master branch worked for 2.3.3 release for branch-2.3.
> After updates in [this PR|https://github.com/apache/spark/pull/23098], 
> {{do-release-docker.sh}} does not work for branch-2.3 now as shown:
> {code}
> ...
> Checked out revision 35358.
> Copying release tarballs
> cp: cannot stat 'pyspark-*': No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28891) do-release-docker.sh in master does not work for branch-2.3

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28891.
---
Fix Version/s: 2.3.4
   Resolution: Fixed

Issue resolved by pull request 25607
[https://github.com/apache/spark/pull/25607]

> do-release-docker.sh in master does not work for branch-2.3
> ---
>
> Key: SPARK-28891
> URL: https://issues.apache.org/jira/browse/SPARK-28891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.4
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.4
>
>
> According to [~maropu], 
> [do-release-docker.sh|https://github.com/apache/spark/blob/master/dev/create-release/do-release-docker.sh]
>  in master branch worked for 2.3.3 release for branch-2.3.
> After updates in [this PR|https://github.com/apache/spark/pull/23098], 
> {{do-release-docker.sh}} does not work for branch-2.3 now as shown:
> {code}
> ...
> Checked out revision 35358.
> Copying release tarballs
> cp: cannot stat 'pyspark-*': No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27931:
-

Assignee: YoungGyu Chun

> Accept 'on' and 'off' as input for boolean data type
> 
>
> Key: SPARK-27931
> URL: https://issues.apache.org/jira/browse/SPARK-27931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: YoungGyu Chun
>Priority: Major
>
> This ticket contains three things:
>  1. Accept 'on' and 'off' as input for boolean data type
> {code:sql}
> SELECT cast('no' as boolean) AS false;
> SELECT cast('off' as boolean) AS false;
> {code}
> 2. Accept unique prefixes thereof:
> {code:sql}
> SELECT cast('of' as boolean) AS false;
> SELECT cast('fal' as boolean) AS false;
> {code}
> 3. Trim the string when cast to boolean type
> {code:sql}
> SELECT cast('true   ' as boolean) AS true;
> SELECT cast(' FALSE' as boolean) AS true;
> {code}
> More details:
>  [https://www.postgresql.org/docs/devel/datatype-boolean.html]
>  
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
>  
> [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]
>  
> [https://github.com/postgres/postgres/commit/9729c9360886bee7feddc6a1124b0742de4b9f3d]
> Other DBs:
>  [http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  [https://my.vertica.com/docs/5.0/HTML/Master/2983.htm]
>  
> [https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27931.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25458
[https://github.com/apache/spark/pull/25458]

> Accept 'on' and 'off' as input for boolean data type
> 
>
> Key: SPARK-27931
> URL: https://issues.apache.org/jira/browse/SPARK-27931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: YoungGyu Chun
>Priority: Major
> Fix For: 3.0.0
>
>
> This ticket contains three things:
>  1. Accept 'on' and 'off' as input for boolean data type
> {code:sql}
> SELECT cast('no' as boolean) AS false;
> SELECT cast('off' as boolean) AS false;
> {code}
> 2. Accept unique prefixes thereof:
> {code:sql}
> SELECT cast('of' as boolean) AS false;
> SELECT cast('fal' as boolean) AS false;
> {code}
> 3. Trim the string when cast to boolean type
> {code:sql}
> SELECT cast('true   ' as boolean) AS true;
> SELECT cast(' FALSE' as boolean) AS true;
> {code}
> More details:
>  [https://www.postgresql.org/docs/devel/datatype-boolean.html]
>  
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
>  
> [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]
>  
> [https://github.com/postgres/postgres/commit/9729c9360886bee7feddc6a1124b0742de4b9f3d]
> Other DBs:
>  [http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  [https://my.vertica.com/docs/5.0/HTML/Master/2983.htm]
>  
> [https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919893#comment-16919893
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

For user name, we have to pass {{USER}} environment variable to the docker 
container at the end of {{do-release-docker.sh}}. I created a patch to fix this.

For other information to be got by {{git}} command, {{spark-build-info}} script 
is not executed at the wrong directory (i.e. out of the cloned directory). My 
guess is the command is executed under the work directory. I did not creat a 
patch yet.

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919869#comment-16919869
 ] 

Dongjoon Hyun commented on SPARK-28921:
---

Could you make a PR with your test case, [~psschwei]?

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Critical
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28889) Allow UDTs to define custom casting behavior

2019-08-30 Thread Zachary S Ennenga (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919822#comment-16919822
 ] 

Zachary S Ennenga edited comment on SPARK-28889 at 8/30/19 6:43 PM:


While I understand if the spark team is not particularly interested in solving 
this problem themselves at this time, I'm more concerned with understanding if 
this is in line with the eventual solution to UDTs and datasets. If it is, I'm 
about halfway through the PR as is, and I'm happy to complete it. If it's not, 
I'm curious what the plan is, and if it's represented in Jira, I'd love to know 
what tickets so I can follow along.


was (Author: zennenga):
While I understand if the spark team is not particularly interested in solving 
this problem themselves at this time, I'm more concerned with understanding if 
this is in line with the eventual solution to UDTs and datasets. If it is, I'm 
about halfway through the PR as is, and I'm happy to complete it.

> Allow UDTs to define custom casting behavior
> 
>
> Key: SPARK-28889
> URL: https://issues.apache.org/jira/browse/SPARK-28889
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Zachary S Ennenga
>Priority: Minor
>
> Looking at `org.apache.spark.sql.catalyst.expressions.Cast`, UDTs do not 
> support any sort of casting except for identity casts, IE:
> {code:java}
> case (udt1: UserDefinedType[_], udt2: UserDefinedType[_]) if udt1.userClass 
> == udt2.userClass =>
>  true
> {code}
> I propose we add an additional piece of functionality here to allow UDTs to 
> define their own canCast and cast functions to allow users to define their 
> own cast mechanisms.
> An example of how this might look:
> {code:java}
> case (fromType, toType: UserDefinedType[_]) =>
>  toType.canCast(fromType) // Returns boolean
> {code}
> {code:java}
> case (fromType, toType: UserDefinedType[_]) =>
>  toType.cast(fromType) // Returns Casting function
> {code}
> The UDT base class would contain a default implementation that replicates 
> current behavior (IE no casting).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28889) Allow UDTs to define custom casting behavior

2019-08-30 Thread Zachary S Ennenga (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919822#comment-16919822
 ] 

Zachary S Ennenga commented on SPARK-28889:
---

While I understand if the spark team is not particularly interested in solving 
this problem themselves at this time, I'm more concerned with understanding if 
this is in line with the eventual solution to UDTs and datasets. If it is, I'm 
about halfway through the PR as is, and I'm happy to complete it.

> Allow UDTs to define custom casting behavior
> 
>
> Key: SPARK-28889
> URL: https://issues.apache.org/jira/browse/SPARK-28889
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Zachary S Ennenga
>Priority: Minor
>
> Looking at `org.apache.spark.sql.catalyst.expressions.Cast`, UDTs do not 
> support any sort of casting except for identity casts, IE:
> {code:java}
> case (udt1: UserDefinedType[_], udt2: UserDefinedType[_]) if udt1.userClass 
> == udt2.userClass =>
>  true
> {code}
> I propose we add an additional piece of functionality here to allow UDTs to 
> define their own canCast and cast functions to allow users to define their 
> own cast mechanisms.
> An example of how this might look:
> {code:java}
> case (fromType, toType: UserDefinedType[_]) =>
>  toType.canCast(fromType) // Returns boolean
> {code}
> {code:java}
> case (fromType, toType: UserDefinedType[_]) =>
>  toType.cast(fromType) // Returns Casting function
> {code}
> The UDT base class would contain a default implementation that replicates 
> current behavior (IE no casting).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28889) Allow UDTs to define custom casting behavior

2019-08-30 Thread Zachary S Ennenga (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919819#comment-16919819
 ] 

Zachary S Ennenga commented on SPARK-28889:
---

Based on https://issues.apache.org/jira/browse/SPARK-7768 it seems the intent 
is to make it public again, though it has been pushed back a few times for 
reasons that aren't really discussed in the ticket. Is there another solution 
for defining custom encodes for types within datasets before that ticket is set 
to be completed?

If there isn't, and the intent to solve that problem via UDTs, this enhancement 
seems useful to solve a specific set of problems, specifically, for 
automatically transforming simple types in hive (IE string) to complex types 
(LocalDate) in datasets by using dataframe.as[ComplexType].

> Allow UDTs to define custom casting behavior
> 
>
> Key: SPARK-28889
> URL: https://issues.apache.org/jira/browse/SPARK-28889
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Zachary S Ennenga
>Priority: Minor
>
> Looking at `org.apache.spark.sql.catalyst.expressions.Cast`, UDTs do not 
> support any sort of casting except for identity casts, IE:
> {code:java}
> case (udt1: UserDefinedType[_], udt2: UserDefinedType[_]) if udt1.userClass 
> == udt2.userClass =>
>  true
> {code}
> I propose we add an additional piece of functionality here to allow UDTs to 
> define their own canCast and cast functions to allow users to define their 
> own cast mechanisms.
> An example of how this might look:
> {code:java}
> case (fromType, toType: UserDefinedType[_]) =>
>  toType.canCast(fromType) // Returns boolean
> {code}
> {code:java}
> case (fromType, toType: UserDefinedType[_]) =>
>  toType.cast(fromType) // Returns Casting function
> {code}
> The UDT base class would contain a default implementation that replicates 
> current behavior (IE no casting).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Paul Schweigert (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Schweigert updated SPARK-28921:

Comment: was deleted

(was: Possible duplicate of https://issues.apache.org/jira/browse/SPARK-28925)

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Critical
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Paul Schweigert (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Schweigert updated SPARK-28921:

Comment: was deleted

(was: Longer-term solution will be to upgrade the version of the 
kubernetes-client : [https://github.com/fabric8io/kubernetes-client/pull/1669])

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Critical
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28930) Spark DESC FORMATTED TABLENAME information display issues

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28930:
--
Description: 
Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
*Last Access time and* feeling some information displays can make it better.

Test steps:
 1. Open spark sql
 2. Create table with partition
 CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
STRING, usd_flag STRING, salary DOUBLE, deductions MAP, address 
STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE location 
'hdfs://hacluster/user/sparkhive/warehouse';
 3. from spark sql check the table description
 desc formatted tablename;
 4. From scala shell check the table description
 sql("desc formatted tablename").show()

*Issue1:*
 If there is no comment for spark scala shell shows *"null" in small letters* 
but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
*CAPITAL "NULL*". Better to show same in all places.

 
{code}
*scala>* sql("desc formatted employees_info_extended").show(false);
 +-+---++---
|col_name|data_type|*comment*|

+-+---++---
|id|int|*null*|
|name|string|*null*|
|usd_flag|string|*null*|
|salary|double|*null*|
|deductions|map|*null*|
|address|string|null|
|entrytime|string|null|
| # Partition Information| | |
| # col_name|data_type|comment|
|entrytime|string|null|
| | | |
| # Detailed Table Information| | |
|Database|sparkdb__| |
|Table|employees_info_extended| |
|Owner|root| |

*|Created Time |Tue Aug 20 13:42:06 CST 2019| |*
 *|Last Access |Thu Jan 01 08:00:00 CST 1970| |*
|Created By|Spark 2.4.3| |
|Type|EXTERNAL| |
|Provider|hive| |

+-+---++---
 only showing top 20 rows

*scala>*
{code}

*Issue 2:*
 Spark SQL "desc formatted tablename" is not showing the header [# 
col_name,data_type,comment|#col_name,data_type,comment] in the top of the query 
result.But header is showing on top of partition description. For Better 
understanding show the header on Top of the query result.

{code}
*spark-sql>* desc formatted employees_info_extended1;
 id int *NULL*
 name string *NULL*
 usd_flag string NULL
 salary double NULL
 deductions map NULL
 address string NULL
 entrytime string NULL
 * 
 ## Partition Information*
 ## col_name data_type comment*
 entrytime string *NULL*

 # Detailed Table Information
 Database sparkdb__
 Table employees_info_extended1
 Owner spark
 *Created Time Tue Aug 20 14:50:37 CST 2019*
 *Last Access Thu Jan 01 08:00:00 CST 1970*
 Created By Spark 2.3.2.0201
 Type EXTERNAL
 Provider hive
 Table Properties [transient_lastDdlTime=1566286655]
 Location hdfs://hacluster/user/sparkhive/warehouse
 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 InputFormat org.apache.hadoop.mapred.TextInputFormat
 OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 Storage Properties [serialization.format=1]
 Partition Provider Catalog
 Time taken: 0.477 seconds, Fetched 27 row(s)
 *spark-sql>*
{code}
 

*Issue 3:*
 I created the table on Aug 20.So it is showing created time correct .*But Last 
access time showing 1970 Jan 01*. It is not good to show Last access time 
earlier time than the created time.Better to show the correct date and time 
else show UNKNOWN.
 *[Created Time,Tue Aug 20 13:42:06 CST 2019,]*
 *[Last Access,Thu Jan 01 08:00:00 CST 1970,]*

  was:
Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
*Last Access time and* feeling some information displays can make it better.

Test steps:
 1. Open spark sql
 2. Create table with partition
 CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
STRING, usd_flag STRING, salary DOUBLE, deductions MAP, address 
STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE location 
'hdfs://hacluster/user/sparkhive/warehouse';
 3. from spark sql check the table description
 desc formatted tablename;
 4. From scala shell check the table description
 sql("desc formatted tablename").show()

*Issue1:*
 If there is no comment for spark scala shell shows *"null" in small letters* 
but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
*CAPITAL "NULL*". Better to show same in all places.

 

*scala>* sql("desc formatted employees_info_extended").show(false);
 +-+---++---
|col_name|data_type|*comment*|

+-+---++---
|id|int|*null*|
|name|string|*null*|
|usd_flag|string|*null*|
|salary|double|*null*|
|deductions|map|*null*|
|address|string|null|
|entrytime|string|null|
| # Partition Information| | |
| # col_name|data_type|comment|
|entrytime|string|null|
| | | |
| # Detailed Table Information| | |
|Database|sparkdb__|

[jira] [Updated] (SPARK-28930) Spark DESC FORMATTED TABLENAME information display issues

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28930:
--
Component/s: (was: Spark Shell)

> Spark DESC FORMATTED TABLENAME information display issues
> -
>
> Key: SPARK-28930
> URL: https://issues.apache.org/jira/browse/SPARK-28930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Minor
>
> Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
> *Last Access time and* feeling some information displays can make it better.
> Test steps:
>  1. Open spark sql
>  2. Create table with partition
>  CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
> STRING, usd_flag STRING, salary DOUBLE, deductions MAP, 
> address STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE 
> location 'hdfs://hacluster/user/sparkhive/warehouse';
>  3. from spark sql check the table description
>  desc formatted tablename;
>  4. From scala shell check the table description
>  sql("desc formatted tablename").show()
> *Issue1:*
>  If there is no comment for spark scala shell shows *"null" in small letters* 
> but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
> *CAPITAL "NULL*". Better to show same in all places.
>  
> {code}
> *scala>* sql("desc formatted employees_info_extended").show(false);
>  +-+---++---
> |col_name|data_type|*comment*|
> +-+---++---
> |id|int|*null*|
> |name|string|*null*|
> |usd_flag|string|*null*|
> |salary|double|*null*|
> |deductions|map|*null*|
> |address|string|null|
> |entrytime|string|null|
> | # Partition Information| | |
> | # col_name|data_type|comment|
> |entrytime|string|null|
> | | | |
> | # Detailed Table Information| | |
> |Database|sparkdb__| |
> |Table|employees_info_extended| |
> |Owner|root| |
> *|Created Time |Tue Aug 20 13:42:06 CST 2019| |*
>  *|Last Access |Thu Jan 01 08:00:00 CST 1970| |*
> |Created By|Spark 2.4.3| |
> |Type|EXTERNAL| |
> |Provider|hive| |
> +-+---++---
>  only showing top 20 rows
> *scala>*
> {code}
> *Issue 2:*
>  Spark SQL "desc formatted tablename" is not showing the header [# 
> col_name,data_type,comment|#col_name,data_type,comment] in the top of the 
> query result.But header is showing on top of partition description. For 
> Better understanding show the header on Top of the query result.
> {code}
> *spark-sql>* desc formatted employees_info_extended1;
>  id int *NULL*
>  name string *NULL*
>  usd_flag string NULL
>  salary double NULL
>  deductions map NULL
>  address string NULL
>  entrytime string NULL
>  * 
>  ## Partition Information*
>  ## col_name data_type comment*
>  entrytime string *NULL*
>  # Detailed Table Information
>  Database sparkdb__
>  Table employees_info_extended1
>  Owner spark
>  *Created Time Tue Aug 20 14:50:37 CST 2019*
>  *Last Access Thu Jan 01 08:00:00 CST 1970*
>  Created By Spark 2.3.2.0201
>  Type EXTERNAL
>  Provider hive
>  Table Properties [transient_lastDdlTime=1566286655]
>  Location hdfs://hacluster/user/sparkhive/warehouse
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.TextInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  Storage Properties [serialization.format=1]
>  Partition Provider Catalog
>  Time taken: 0.477 seconds, Fetched 27 row(s)
>  *spark-sql>*
> {code}
>  
> *Issue 3:*
>  I created the table on Aug 20.So it is showing created time correct .*But 
> Last access time showing 1970 Jan 01*. It is not good to show Last access 
> time earlier time than the created time.Better to show the correct date and 
> time else show UNKNOWN.
>  *[Created Time,Tue Aug 20 13:42:06 CST 2019,]*
>  *[Last Access,Thu Jan 01 08:00:00 CST 1970,]*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28571) Shuffle storage API: Use API in SortShuffleWriter

2019-08-30 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28571:
--

Assignee: Matt Cheah

> Shuffle storage API: Use API in SortShuffleWriter
> -
>
> Key: SPARK-28571
> URL: https://issues.apache.org/jira/browse/SPARK-28571
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
>
> Use the APIs introduced in SPARK-28209 in the SortShuffleWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28571) Shuffle storage API: Use API in SortShuffleWriter

2019-08-30 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28571.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25342
[https://github.com/apache/spark/pull/25342]

> Shuffle storage API: Use API in SortShuffleWriter
> -
>
> Key: SPARK-28571
> URL: https://issues.apache.org/jira/browse/SPARK-28571
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> Use the APIs introduced in SPARK-28209 in the SortShuffleWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28759.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25633
[https://github.com/apache/spark/pull/25633]

> Upgrade scala-maven-plugin to 4.2.0
> ---
>
> Key: SPARK-28759
> URL: https://issues.apache.org/jira/browse/SPARK-28759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0

2019-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28759:
-

Assignee: Hyukjin Kwon

> Upgrade scala-maven-plugin to 4.2.0
> ---
>
> Key: SPARK-28759
> URL: https://issues.apache.org/jira/browse/SPARK-28759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28866) Persist item factors RDD when checkpointing in ALS

2019-08-30 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28866.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25576
[https://github.com/apache/spark/pull/25576]

> Persist item factors RDD when checkpointing in ALS
> --
>
> Key: SPARK-28866
> URL: https://issues.apache.org/jira/browse/SPARK-28866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> In ALS ML implementation, if `implicitPrefs` is false, we checkpoint the RDD 
> of item factors, between intervals. Before checkpointing and materializing 
> RDD, this RDD was not persisted. It causes recomputation. In an experiment, 
> there is performance difference between persisting and no persisting before 
> checkpointing the RDD.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28866) Persist item factors RDD when checkpointing in ALS

2019-08-30 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28866:
-

Assignee: Liang-Chi Hsieh

> Persist item factors RDD when checkpointing in ALS
> --
>
> Key: SPARK-28866
> URL: https://issues.apache.org/jira/browse/SPARK-28866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> In ALS ML implementation, if `implicitPrefs` is false, we checkpoint the RDD 
> of item factors, between intervals. Before checkpointing and materializing 
> RDD, this RDD was not persisted. It causes recomputation. In an experiment, 
> there is performance difference between persisting and no persisting before 
> checkpointing the RDD.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2019-08-30 Thread Josh Rosen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919697#comment-16919697
 ] 

Josh Rosen commented on SPARK-28900:


Some quick notes / braindump (I may write more later):
 * AFAIK _release_ publishing / artifact signing hasn't taken place on Jenkins 
for a while now (I'm not sure if we're still doing snapshot publishing there, 
though). Given this, we should delete unused publishing builders and their 
associated credentials (which I think have been rotated anyways). I'm _pretty_ 
sure this is technically feasible, but it's been a long time since I've last 
investigated. If we delete the publishing builders then it should be fairly 
straightforward to dump a snapshot of the JJB scripts into a public repo 
(sans-git-history, perhaps).
 * We should consider removing code / builders for old branches which will 
never be patched (such as {{branch-1.6}}). This may simplify the build scripts.
 * Strong +1 from me towards using Dockerized build container: a standard 
Docker environment would let us remove most of the legacy build cruft.
 ** IIRC Dockerization of these builds in AMPLab Jenkins was historically 
blocked by the old version of CentOS's Docker support: the Docker daemon would 
lock up / freeze if launching many PR builder jobs in parallel. This should be 
fixed  for the newer Ubuntu hosts, though.
 ** Alternatively, eventually porting all of this to Bazel and sourcing all 
languages' dependencies and toolchains from there instead from the local 
environment would sidestep a lot of these problems.
 * I think we may already have some mechanism which builds conda environments / 
virtualenvs for the PySpark packaging tests? Maybe that could be used for the 
regular PySpark tests as well?

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \

[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Paul Schweigert (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Schweigert updated SPARK-28921:

Description: 
Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
provision executor pods (jobs like Spark-Pi that do not launch executors run 
without a problem):

 

Here's an example error message:

 
{code:java}
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: HTTP 
403, Status: 403 - 
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' 
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{code}
 

Looks like the issue is caused by fixes for a recent CVE : 

CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]

Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]

 

Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.

  was:
Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
provision executor pods (jobs like Spark-Pi that do not launch executors run 
without a problem):

 

Here's an example error message:

 
{code:java}
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: HTTP 
403, Status: 403 - 
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' 
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{code}
 

Looks like the issue is caused by the internal master Kubernetes url not having 
the port specified:

[https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]

 

Using the master with the port (443) seems to fix the problem.

 


> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Critical
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (SPARK-28931) Fix couple of bugs in FsHistoryProviderSuite

2019-08-30 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-28931:


 Summary: Fix couple of bugs in FsHistoryProviderSuite
 Key: SPARK-28931
 URL: https://issues.apache.org/jira/browse/SPARK-28931
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


There're some bugs reside on FsHistoryProviderSuite itself.
 # When creating log file via {{newLogFile}}, codec is ignored, leading to 
wrong file name. (No one tends to create test for test code, as well as the bug 
doesn't affect existing tests indeed, so not easy to catch.)
 # When writing events to log file via {{writeFile}}, metadata (in case of new 
format) gets written to file regardless of its codec, and the content is 
overwritten by another stream, hence no information for Spark version is 
available. It affects existing test, hence we have wrong expected value to 
workaround the bug.

Note that they're bugs on test code, non-test code works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Paul Schweigert (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Schweigert updated SPARK-28921:

Priority: Critical  (was: Minor)

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Critical
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by the internal master Kubernetes url not 
> having the port specified:
> [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]
>  
> Using the master with the port (443) seems to fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Paul Schweigert (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919608#comment-16919608
 ] 

Paul Schweigert commented on SPARK-28921:
-

Possible duplicate of https://issues.apache.org/jira/browse/SPARK-28925

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Minor
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by the internal master Kubernetes url not 
> having the port specified:
> [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]
>  
> Using the master with the port (443) seems to fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-30 Thread Paul Schweigert (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919596#comment-16919596
 ] 

Paul Schweigert commented on SPARK-28921:
-

Longer-term solution will be to upgrade the version of the kubernetes-client : 
[https://github.com/fabric8io/kubernetes-client/pull/1669]

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Minor
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by the internal master Kubernetes url not 
> having the port specified:
> [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]
>  
> Using the master with the port (443) seems to fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Stavros Kontopoulos (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919583#comment-16919583
 ] 

Stavros Kontopoulos commented on SPARK-28025:
-

Thanks I will have a look :)

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Stavros Kontopoulos (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919577#comment-16919577
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 8/30/19 2:15 PM:
--

[~kabhwan] cool I have a look.


was (Author: skonto):
[~kabhwan] which PR?

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919580#comment-16919580
 ] 

Gabor Somogyi commented on SPARK-28025:
---

[~skonto], this one: [https://github.com/apache/spark/pull/25488]

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Stavros Kontopoulos (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919577#comment-16919577
 ] 

Stavros Kontopoulos commented on SPARK-28025:
-

[~kabhwan] which PR?

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28930) Spark DESC FORMATTED TABLENAME information display issues

2019-08-30 Thread Sujith Chacko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919563#comment-16919563
 ] 

Sujith Chacko commented on SPARK-28930:
---

@ [~jobitmathew] As i remember Issue 3 is already handled as part of 
SPARK-24812 some time back, need to recheck. other issues i will check and get 
back to you.

cc [~dongjoon] 

> Spark DESC FORMATTED TABLENAME information display issues
> -
>
> Key: SPARK-28930
> URL: https://issues.apache.org/jira/browse/SPARK-28930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Minor
>
> Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
> *Last Access time and* feeling some information displays can make it better.
> Test steps:
>  1. Open spark sql
>  2. Create table with partition
>  CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
> STRING, usd_flag STRING, salary DOUBLE, deductions MAP, 
> address STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE 
> location 'hdfs://hacluster/user/sparkhive/warehouse';
>  3. from spark sql check the table description
>  desc formatted tablename;
>  4. From scala shell check the table description
>  sql("desc formatted tablename").show()
> *Issue1:*
>  If there is no comment for spark scala shell shows *"null" in small letters* 
> but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
> *CAPITAL "NULL*". Better to show same in all places.
>  
> *scala>* sql("desc formatted employees_info_extended").show(false);
>  +-+---++---
> |col_name|data_type|*comment*|
> +-+---++---
> |id|int|*null*|
> |name|string|*null*|
> |usd_flag|string|*null*|
> |salary|double|*null*|
> |deductions|map|*null*|
> |address|string|null|
> |entrytime|string|null|
> | # Partition Information| | |
> | # col_name|data_type|comment|
> |entrytime|string|null|
> | | | |
> | # Detailed Table Information| | |
> |Database|sparkdb__| |
> |Table|employees_info_extended| |
> |Owner|root| |
> *|Created Time |Tue Aug 20 13:42:06 CST 2019| |*
>  *|Last Access |Thu Jan 01 08:00:00 CST 1970| |*
> |Created By|Spark 2.4.3| |
> |Type|EXTERNAL| |
> |Provider|hive| |
> +-+---++---
>  only showing top 20 rows
> *scala>*
> *Issue 2:*
>  Spark SQL "desc formatted tablename" is not showing the header [# 
> col_name,data_type,comment|#col_name,data_type,comment] in the top of the 
> query result.But header is showing on top of partition description. For 
> Better understanding show the header on Top of the query result.
> *spark-sql>* desc formatted employees_info_extended1;
>  id int *NULL*
>  name string *NULL*
>  usd_flag string NULL
>  salary double NULL
>  deductions map NULL
>  address string NULL
>  entrytime string NULL
>  * 
>  ## Partition Information*
>  ## col_name data_type comment*
>  entrytime string *NULL*
>  # Detailed Table Information
>  Database sparkdb__
>  Table employees_info_extended1
>  Owner spark
>  *Created Time Tue Aug 20 14:50:37 CST 2019*
>  *Last Access Thu Jan 01 08:00:00 CST 1970*
>  Created By Spark 2.3.2.0201
>  Type EXTERNAL
>  Provider hive
>  Table Properties [transient_lastDdlTime=1566286655]
>  Location hdfs://hacluster/user/sparkhive/warehouse
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.TextInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  Storage Properties [serialization.format=1]
>  Partition Provider Catalog
>  Time taken: 0.477 seconds, Fetched 27 row(s)
>  *spark-sql>*
>  
> *Issue 3:*
>  I created the table on Aug 20.So it is showing created time correct .*But 
> Last access time showing 1970 Jan 01*. It is not good to show Last access 
> time earlier time than the created time.Better to show the correct date and 
> time else show UNKNOWN.
>  *[Created Time,Tue Aug 20 13:42:06 CST 2019,]*
>  *[Last Access,Thu Jan 01 08:00:00 CST 1970,]*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28930) Spark DESC FORMATTED TABLENAME information display issues

2019-08-30 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-28930:
-
Description: 
Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
*Last Access time and* feeling some information displays can make it better.

Test steps:
 1. Open spark sql
 2. Create table with partition
 CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
STRING, usd_flag STRING, salary DOUBLE, deductions MAP, address 
STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE location 
'hdfs://hacluster/user/sparkhive/warehouse';
 3. from spark sql check the table description
 desc formatted tablename;
 4. From scala shell check the table description
 sql("desc formatted tablename").show()

*Issue1:*
 If there is no comment for spark scala shell shows *"null" in small letters* 
but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
*CAPITAL "NULL*". Better to show same in all places.

 

*scala>* sql("desc formatted employees_info_extended").show(false);
 +-+---++---
|col_name|data_type|*comment*|

+-+---++---
|id|int|*null*|
|name|string|*null*|
|usd_flag|string|*null*|
|salary|double|*null*|
|deductions|map|*null*|
|address|string|null|
|entrytime|string|null|
| # Partition Information| | |
| # col_name|data_type|comment|
|entrytime|string|null|
| | | |
| # Detailed Table Information| | |
|Database|sparkdb__| |
|Table|employees_info_extended| |
|Owner|root| |

*|Created Time |Tue Aug 20 13:42:06 CST 2019| |*
 *|Last Access |Thu Jan 01 08:00:00 CST 1970| |*
|Created By|Spark 2.4.3| |
|Type|EXTERNAL| |
|Provider|hive| |

+-+---++---
 only showing top 20 rows

*scala>*

*Issue 2:*
 Spark SQL "desc formatted tablename" is not showing the header [# 
col_name,data_type,comment|#col_name,data_type,comment] in the top of the query 
result.But header is showing on top of partition description. For Better 
understanding show the header on Top of the query result.

*spark-sql>* desc formatted employees_info_extended1;
 id int *NULL*
 name string *NULL*
 usd_flag string NULL
 salary double NULL
 deductions map NULL
 address string NULL
 entrytime string NULL
 * 
 ## Partition Information*
 ## col_name data_type comment*
 entrytime string *NULL*

 # Detailed Table Information
 Database sparkdb__
 Table employees_info_extended1
 Owner spark
 *Created Time Tue Aug 20 14:50:37 CST 2019*
 *Last Access Thu Jan 01 08:00:00 CST 1970*
 Created By Spark 2.3.2.0201
 Type EXTERNAL
 Provider hive
 Table Properties [transient_lastDdlTime=1566286655]
 Location hdfs://hacluster/user/sparkhive/warehouse
 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 InputFormat org.apache.hadoop.mapred.TextInputFormat
 OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 Storage Properties [serialization.format=1]
 Partition Provider Catalog
 Time taken: 0.477 seconds, Fetched 27 row(s)
 *spark-sql>*

 

*Issue 3:*
 I created the table on Aug 20.So it is showing created time correct .*But Last 
access time showing 1970 Jan 01*. It is not good to show Last access time 
earlier time than the created time.Better to show the correct date and time 
else show UNKNOWN.
 *[Created Time,Tue Aug 20 13:42:06 CST 2019,]*
 *[Last Access,Thu Jan 01 08:00:00 CST 1970,]*

  was:
Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
*Last Access time and* feeling some information displays can make it better.

Test steps:
 1. Open spark sql
 2. Create table with partition
 CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
STRING, usd_flag STRING, salary DOUBLE, deductions MAP, address 
STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE location 
'hdfs://hacluster/user/sparkhive/warehouse';
 3. from spark sql check the table description
 desc formatted tablename;
 4. From scala shell check the table description
 sql("desc formatted tablename").show()

Issue1:
 If there is no comment for spark scala shell shows *"null" in small letters* 
but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
*CAPITAL "NULL*". Better to show same in all places.

 

*scala>* sql("desc formatted employees_info_extended").show(false);
+++---+
|col_name |data_type |*comment*|
+++---+
|id |int |*null* |
|name |string |*null* |
|usd_flag |string |*null* |
|salary |double |*null* |
|deductions |map |*null* |
|address |string |null |
|entrytime |string |null |
|# Partition Information | | |
|# col_name |data_type |comment|
|entrytime |string |null |
| | | |
|# Detailed Table Information| | |
|Database |sparkdb__ | |

[jira] [Commented] (SPARK-28929) Spark Logging level should be INFO instead of Debug in Executor Plugin API[SPARK-24918]

2019-08-30 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919524#comment-16919524
 ] 

Rakesh Raushan commented on SPARK-28929:


i am working on this.

> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918]
> ---
>
> Key: SPARK-28929
> URL: https://issues.apache.org/jira/browse/SPARK-28929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.2, 2.4.3
>Reporter: jobit mathew
>Priority: Minor
>
> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918].
> Currently logging level for Executor Plugin API[SPARK-24918] is DEBUG
> logDebug(s"Initializing the following plugins: $\{pluginNames.mkString(", 
> ")}")
> logDebug(s"Successfully loaded plugin " + 
> plugin.getClass().getCanonicalName())
> logDebug("Finished initializing plugins")
> It is better to change to  INFO instead of DEBUG.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28929) Spark Logging level should be INFO instead of Debug in Executor Plugin API[SPARK-24918]

2019-08-30 Thread pavithra ramachandran (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pavithra ramachandran updated SPARK-28929:
--
Comment: was deleted

(was: I am working on this

 )

> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918]
> ---
>
> Key: SPARK-28929
> URL: https://issues.apache.org/jira/browse/SPARK-28929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.2, 2.4.3
>Reporter: jobit mathew
>Priority: Minor
>
> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918].
> Currently logging level for Executor Plugin API[SPARK-24918] is DEBUG
> logDebug(s"Initializing the following plugins: $\{pluginNames.mkString(", 
> ")}")
> logDebug(s"Successfully loaded plugin " + 
> plugin.getClass().getCanonicalName())
> logDebug("Finished initializing plugins")
> It is better to change to  INFO instead of DEBUG.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28929) Spark Logging level should be INFO instead of Debug in Executor Plugin API[SPARK-24918]

2019-08-30 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919523#comment-16919523
 ] 

pavithra ramachandran commented on SPARK-28929:
---

I am working on this

 

> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918]
> ---
>
> Key: SPARK-28929
> URL: https://issues.apache.org/jira/browse/SPARK-28929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.2, 2.4.3
>Reporter: jobit mathew
>Priority: Minor
>
> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918].
> Currently logging level for Executor Plugin API[SPARK-24918] is DEBUG
> logDebug(s"Initializing the following plugins: $\{pluginNames.mkString(", 
> ")}")
> logDebug(s"Successfully loaded plugin " + 
> plugin.getClass().getCanonicalName())
> logDebug("Finished initializing plugins")
> It is better to change to  INFO instead of DEBUG.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28930) Spark DESC FORMATTED TABLENAME information display issues

2019-08-30 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-28930:
-
Environment: (was: _*_emphasized text_*_)

> Spark DESC FORMATTED TABLENAME information display issues
> -
>
> Key: SPARK-28930
> URL: https://issues.apache.org/jira/browse/SPARK-28930
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Minor
>
> Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
> *Last Access time and* feeling some information displays can make it better.
> Test steps:
>  1. Open spark sql
>  2. Create table with partition
>  CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
> STRING, usd_flag STRING, salary DOUBLE, deductions MAP, 
> address STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE 
> location 'hdfs://hacluster/user/sparkhive/warehouse';
>  3. from spark sql check the table description
>  desc formatted tablename;
>  4. From scala shell check the table description
>  sql("desc formatted tablename").show()
> Issue1:
>  If there is no comment for spark scala shell shows *"null" in small letters* 
> but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
> *CAPITAL "NULL*". Better to show same in all places.
>  
> *scala>* sql("desc formatted employees_info_extended").show(false);
> +++---+
> |col_name |data_type |*comment*|
> +++---+
> |id |int |*null* |
> |name |string |*null* |
> |usd_flag |string |*null* |
> |salary |double |*null* |
> |deductions |map |*null* |
> |address |string |null |
> |entrytime |string |null |
> |# Partition Information | | |
> |# col_name |data_type |comment|
> |entrytime |string |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |sparkdb__ | |
> |Table |employees_info_extended | |
> |Owner |root | |
> *|Created Time |Tue Aug 20 13:42:06 CST 2019| |*
> *|Last Access |Thu Jan 01 08:00:00 CST 1970| |*
> |Created By |Spark 2.4.3 | |
> |Type |EXTERNAL | |
> |Provider |hive | |
> +++---+
> only showing top 20 rows
> *scala>*
>  Issue 2:
>  Spark SQL "desc formatted tablename" is not showing the header [# 
> col_name,data_type,comment|#col_name,data_type,comment] in the top of the 
> query result.But header is showing on top of partition description. For 
> Better understanding show the header on Top of the query result.
> *spark-sql>* desc formatted employees_info_extended1;
> id int *NULL*
> name string *NULL*
> usd_flag string NULL
> salary double NULL
> deductions map NULL
> address string NULL
> entrytime string NULL
> *# Partition Information*
> *# col_name data_type comment*
> entrytime string *NULL*
> # Detailed Table Information
> Database sparkdb__
> Table employees_info_extended1
> Owner spark
> *Created Time Tue Aug 20 14:50:37 CST 2019*
> *Last Access Thu Jan 01 08:00:00 CST 1970*
> Created By Spark 2.3.2.0201
> Type EXTERNAL
> Provider hive
> Table Properties [transient_lastDdlTime=1566286655]
> Location hdfs://hacluster/user/sparkhive/warehouse
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties [serialization.format=1]
> Partition Provider Catalog
> Time taken: 0.477 seconds, Fetched 27 row(s)
> *spark-sql>*
>  
>  Issue 3:
>  I created the table on Aug 20.So it is showing created time correct .*But 
> Last access time showing 1970 Jan 01*. It is not good to show Last access 
> time earlier time than the created time.Better to show the correct date and 
> time else show UNKNOWN.
>  *[Created Time,Tue Aug 20 13:42:06 CST 2019,]*
>  *[Last Access,Thu Jan 01 08:00:00 CST 1970,]*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28930) Spark DESC FORMATTED TABLENAME information display issues

2019-08-30 Thread jobit mathew (Jira)

jobit mathew created SPARK-28930:


 Summary: Spark DESC FORMATTED TABLENAME information display issues
 Key: SPARK-28930
 URL: https://issues.apache.org/jira/browse/SPARK-28930
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.4.3
 Environment: _*_emphasized text_*_
Reporter: jobit mathew


Spark DESC FORMATTED TABLENAME information display issues.Showing incorrect 
*Last Access time and* feeling some information displays can make it better.

Test steps:
 1. Open spark sql
 2. Create table with partition
 CREATE EXTERNAL TABLE IF NOT EXISTS employees_info_extended ( id INT, name 
STRING, usd_flag STRING, salary DOUBLE, deductions MAP, address 
STRING ) PARTITIONED BY (entrytime STRING) STORED AS TEXTFILE location 
'hdfs://hacluster/user/sparkhive/warehouse';
 3. from spark sql check the table description
 desc formatted tablename;
 4. From scala shell check the table description
 sql("desc formatted tablename").show()

Issue1:
 If there is no comment for spark scala shell shows *"null" in small letters* 
but all other places Hive beeline/Spark beeline/Spark SQL it is showing in 
*CAPITAL "NULL*". Better to show same in all places.

 

*scala>* sql("desc formatted employees_info_extended").show(false);
+++---+
|col_name |data_type |*comment*|
+++---+
|id |int |*null* |
|name |string |*null* |
|usd_flag |string |*null* |
|salary |double |*null* |
|deductions |map |*null* |
|address |string |null |
|entrytime |string |null |
|# Partition Information | | |
|# col_name |data_type |comment|
|entrytime |string |null |
| | | |
|# Detailed Table Information| | |
|Database |sparkdb__ | |
|Table |employees_info_extended | |
|Owner |root | |
*|Created Time |Tue Aug 20 13:42:06 CST 2019| |*
*|Last Access |Thu Jan 01 08:00:00 CST 1970| |*
|Created By |Spark 2.4.3 | |
|Type |EXTERNAL | |
|Provider |hive | |
+++---+
only showing top 20 rows


*scala>*


 Issue 2:
 Spark SQL "desc formatted tablename" is not showing the header [# 
col_name,data_type,comment|#col_name,data_type,comment] in the top of the query 
result.But header is showing on top of partition description. For Better 
understanding show the header on Top of the query result.

*spark-sql>* desc formatted employees_info_extended1;
id int *NULL*
name string *NULL*
usd_flag string NULL
salary double NULL
deductions map NULL
address string NULL
entrytime string NULL
*# Partition Information*
*# col_name data_type comment*
entrytime string *NULL*

# Detailed Table Information
Database sparkdb__
Table employees_info_extended1
Owner spark
*Created Time Tue Aug 20 14:50:37 CST 2019*
*Last Access Thu Jan 01 08:00:00 CST 1970*
Created By Spark 2.3.2.0201
Type EXTERNAL
Provider hive
Table Properties [transient_lastDdlTime=1566286655]
Location hdfs://hacluster/user/sparkhive/warehouse
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties [serialization.format=1]
Partition Provider Catalog
Time taken: 0.477 seconds, Fetched 27 row(s)
*spark-sql>*

 


 Issue 3:
 I created the table on Aug 20.So it is showing created time correct .*But Last 
access time showing 1970 Jan 01*. It is not good to show Last access time 
earlier time than the created time.Better to show the correct date and time 
else show UNKNOWN.
 *[Created Time,Tue Aug 20 13:42:06 CST 2019,]*
 *[Last Access,Thu Jan 01 08:00:00 CST 1970,]*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28929) Spark Logging level should be INFO instead of Debug in Executor Plugin API[SPARK-24918]

2019-08-30 Thread jobit mathew (Jira)

jobit mathew created SPARK-28929:


 Summary: Spark Logging level should be INFO instead of Debug in 
Executor Plugin API[SPARK-24918]
 Key: SPARK-28929
 URL: https://issues.apache.org/jira/browse/SPARK-28929
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3, 2.4.2
Reporter: jobit mathew


Spark Logging level should be INFO instead of Debug in Executor Plugin 
API[SPARK-24918].

Currently logging level for Executor Plugin API[SPARK-24918] is DEBUG

logDebug(s"Initializing the following plugins: $\{pluginNames.mkString(", ")}")

logDebug(s"Successfully loaded plugin " + plugin.getClass().getCanonicalName())

logDebug("Finished initializing plugins")

It is better to change to  INFO instead of DEBUG.

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919514#comment-16919514
 ] 

Steve Loughran commented on SPARK-28025:


Has anyone considered enhancing org.apache.hadoop.fs.ChecksumFileSystem to say 
"if "file.bytes-per-checksum" == 0 then checksums are disabled?

Currently it fails if bytes per CRC  <= 0, but you could make the 0 value a 
switch to say "none".

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28759:
-
Summary: Upgrade scala-maven-plugin to 4.2.0  (was: Upgrade 
scala-maven-plugin to 4.1.1)

> Upgrade scala-maven-plugin to 4.2.0
> ---
>
> Key: SPARK-28759
> URL: https://issues.apache.org/jira/browse/SPARK-28759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919507#comment-16919507
 ] 

Jungtaek Lim commented on SPARK-28025:
--

[~skonto]

Please take a look at my PR as my PR didn't follow your workaround. We 
identified which Hadoop issue we are facing, and took a workaround as deleting 
crc file manually.

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28911) Unify Kafka source option pattern

2019-08-30 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-28911.
---
Resolution: Won't Do

Based on the discussion on the PR this can be closed too.

> Unify Kafka source option pattern
> -
>
> Key: SPARK-28911
> URL: https://issues.apache.org/jira/browse/SPARK-28911
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Major
>
> Pattern of datasource options is Camel-Case, such as CheckpointLocation, and 
> only some Kafka source option is separated with dot, Such as 
> fetchOffset.numRetries.
> Also we can distinguish the Kafka original options from pattern, such as 
> kafka.bootstrap.servers



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-28911) Unify Kafka source option pattern

2019-08-30 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-28911.
-

> Unify Kafka source option pattern
> -
>
> Key: SPARK-28911
> URL: https://issues.apache.org/jira/browse/SPARK-28911
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Major
>
> Pattern of datasource options is Camel-Case, such as CheckpointLocation, and 
> only some Kafka source option is separated with dot, Such as 
> fetchOffset.numRetries.
> Also we can distinguish the Kafka original options from pattern, such as 
> kafka.bootstrap.servers



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Stavros Kontopoulos (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919471#comment-16919471
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 8/30/19 11:54 AM:
---

@[~dongjoon] [~zsxwing] this needs to be re-opened. When using the workaround 
we recently hit this issue:

[https://github.com/broadinstitute/gatk/issues/1389]

which can be fixed easily with a derived class like in this PR: 
[https://github.com/broadinstitute/gatk/pull/1421/files]

but this is a bit of inconvenient. 

However, I believe as well that this should be fixed in Spark (less surprises) 
otherwise we need to document it as [~kabhwan] said above.


was (Author: skonto):
@[~dongjoon] [~zsxwing] this needs to be re-opened. When using the workaround 
we recently hit this issue:

[https://github.com/broadinstitute/gatk/issues/1389]

which can be fixed easily with a derived class like in this PR: 
[https://github.com/broadinstitute/gatk/pull/1421/files]

However, I believe as well that this should be fixed in Spark (less surprises) 
otherwise we need to document it as [~kabhwan] said above.

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-30 Thread Stavros Kontopoulos (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919471#comment-16919471
 ] 

Stavros Kontopoulos commented on SPARK-28025:
-

@[~dongjoon] [~zsxwing] this needs to be re-opened. When using the workaround 
we recently hit this issue:

[https://github.com/broadinstitute/gatk/issues/1389]

which can be fixed easily with a derived class like in this PR: 
[https://github.com/broadinstitute/gatk/pull/1421/files]

However, I believe as well that this should be fixed in Spark (less surprises) 
otherwise we need to document it as [~kabhwan] said above.

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28928) Take over Kafka delegation token protocol on sources/sinks

2019-08-30 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28928:
--
Summary: Take over Kafka delegation token protocol on sources/sinks  (was: 
Take over delegation token protocol on sources/sinks)

> Take over Kafka delegation token protocol on sources/sinks
> --
>
> Key: SPARK-28928
> URL: https://issues.apache.org/jira/browse/SPARK-28928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> At the moment there are 3 places where communication protocol with Kafka 
> cluster has to be configured:
>  * On delegation token
>  * On source
>  * On sink
> Most of the time users are using the same protocol on all these places 
> (within one Kafka cluster). It would be better to declare it in one place 
> (delegation token side) and Kafka sources/sinks can take this config over.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28928) Take over delegation token protocol on sources/sinks

2019-08-30 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919463#comment-16919463
 ] 

Gabor Somogyi commented on SPARK-28928:
---

I'm working on this.

> Take over delegation token protocol on sources/sinks
> 
>
> Key: SPARK-28928
> URL: https://issues.apache.org/jira/browse/SPARK-28928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> At the moment there are 3 places where communication protocol with Kafka 
> cluster has to be configured:
>  * On delegation token
>  * On source
>  * On sink
> Most of the time users are using the same protocol on all these places 
> (within one Kafka cluster). It would be better to declare it in one place 
> (delegation token side) and Kafka sources/sinks can take this config over.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28928) Take over delegation token protocol on sources/sinks

2019-08-30 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-28928:
-

 Summary: Take over delegation token protocol on sources/sinks
 Key: SPARK-28928
 URL: https://issues.apache.org/jira/browse/SPARK-28928
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


At the moment there are 3 places where communication protocol with Kafka 
cluster has to be configured:
 * On delegation token
 * On source
 * On sink

Most of the time users are using the same protocol on all these places (within 
one Kafka cluster). It would be better to declare it in one place (delegation 
token side) and Kafka sources/sinks can take this config over.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919420#comment-16919420
 ] 

Kazuaki Ishizaki edited comment on SPARK-28906 at 8/30/19 10:51 AM:


In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} 
exists. This file is different between 2.3.0 and 2.3.4.
This file is generated by `build/spark-build-info`.

{code}
$ cat spark-version-info.properties.230
version=2.3.0
user=sameera
revision=a0d7949896e70f427e7f3942ff340c9484ff0aab
branch=master
date=2018-02-22T19:24:38Z
url=g...@github.com:sameeragarwal/spark.git
$ cat spark-version-info.properties.234
version=2.3.4
user=
revision=
branch=
date=2019-08-26T08:29:39Z
url=
{code}


was (Author: kiszk):
In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} 
exists. This file is different between 2.3.0 and 2.3.4.

{code}
$ cat spark-version-info.properties.230
version=2.3.0
user=sameera
revision=a0d7949896e70f427e7f3942ff340c9484ff0aab
branch=master
date=2018-02-22T19:24:38Z
url=g...@github.com:sameeragarwal/spark.git
$ cat spark-version-info.properties.234
version=2.3.4
user=
revision=
branch=
date=2019-08-26T08:29:39Z
url=
{code}

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919420#comment-16919420
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

In {{jars/spark-core_2.11-2.3.*.jar}}, {{spark-version-info.properties}} 
exists. This file is different between 2.3.0 and 2.3.4.

{code}
$ cat spark-version-info.properties.230
version=2.3.0
user=sameera
revision=a0d7949896e70f427e7f3942ff340c9484ff0aab
branch=master
date=2018-02-22T19:24:38Z
url=g...@github.com:sameeragarwal/spark.git
$ cat spark-version-info.properties.234
version=2.3.4
user=
revision=
branch=
date=2019-08-26T08:29:39Z
url=
{code}

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0

2019-08-30 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919346#comment-16919346
 ] 

Björn edited comment on SPARK-28444 at 8/30/19 9:07 AM:


We're running into the same issue. As I'm developing with a local 
ansible/Vagrant setup (Kubernetes deployed through kubeadm, 3 nodes) I did some 
testings on different versions with SparkPi example (Spark 2.4.3). My results 
were:
 * spark-submit in cluster and client mode works fine in local docker-desktop 
running Kubernetes 1.14.3
 * spark-submit in client mode works fine for Kubernetes 1.15.3 in the Vagrant 
multinode cluster
 * spark-submit in cluster mode does not work for Kubernetes 1.15.3, 
1.14.3,1.13.10 in the Vagrant multinode cluster

spark-submit in cluster mode starts the driver, spawns the executors but fails 
when trying to watch the pod with HTTP 403 Exception with an empty message 
(esp. not complaining about permissions). The log is more or less the same as 
posted above.

I think neither the compatibility nor the permissions (as executor pods can be 
created with the service account) are the cause for this. 

Does anyone have ideas how to further debug this?  

 


was (Author: dolkemeier):
We're running into the same issue. As I'm developing with a local 
ansible/Vagrant setup (Kubernetes deployed through kubeadm, 3 nodes) I did some 
testings on different versions with SparkPi example (Spark 2.4.3). My results 
were:
 * spark-submit in cluster and client mode works fine in local docker-desktop 
running Kubernetes 1.14.3
 * spark-submit in client mode works fine for Kubernetes 1.15.3 in the Vagrant 
multinode cluster
 * spark-submit in cluster mode does not work for Kubernetes 1.15.3, 
1.14.3,1.13.10 in the Vagrant multinode cluster

spark-submit in cluster mode starts the driver, spawns the executors but fails 
when trying to watch the pod with HTTP 403 Exception with an empty message 
(esp. not complaining about permissions). The log is more or less the same as 
posted above.

I think neither the compatibility nor the permissions (as executor pods can be 
created with the service account, spark sa has cluster role) are the cause for 
this. 

Does anyone have ideas how to further debug this?  

 

> Bump Kubernetes Client Version to 4.3.0
> ---
>
> Key: SPARK-28444
> URL: https://issues.apache.org/jira/browse/SPARK-28444
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Patrick Winter
>Priority: Major
>
> Spark is currently using the Kubernetes client version 4.1.2. This client 
> does not support the current Kubernetes version 1.14, as can be seen on the 
> [compatibility 
> matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]].
>  Therefore the Kubernetes client should be bumped up to version 4.3.0.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0

2019-08-30 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919346#comment-16919346
 ] 

Björn commented on SPARK-28444:
---

We're running into the same issue. As I'm developing with a local 
ansible/Vagrant setup (Kubernetes deployed through kubeadm, 3 nodes) I did some 
testings on different versions with SparkPi example (Spark 2.4.3). My results 
were:
 * spark-submit in cluster and client mode works fine in local docker-desktop 
running Kubernetes 1.14.3
 * spark-submit in client mode works fine for Kubernetes 1.15.3 in the Vagrant 
multinode cluster
 * spark-submit in cluster mode does not work for Kubernetes 1.15.3, 
1.14.3,1.13.10 in the Vagrant multinode cluster

spark-submit in cluster mode starts the driver, spawns the executors but fails 
when trying to watch the pod with HTTP 403 Exception with an empty message 
(esp. not complaining about permissions). The log is more or less the same as 
posted above.

I think neither the compatibility nor the permissions (as executor pods can be 
created with the service account, spark sa has cluster role) are the cause for 
this. 

Does anyone have ideas how to further debug this?  

 

> Bump Kubernetes Client Version to 4.3.0
> ---
>
> Key: SPARK-28444
> URL: https://issues.apache.org/jira/browse/SPARK-28444
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Patrick Winter
>Priority: Major
>
> Spark is currently using the Kubernetes client version 4.1.2. This client 
> does not support the current Kubernetes version 1.14, as can be seen on the 
> [compatibility 
> matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]].
>  Therefore the Kubernetes client should be bumped up to version 4.3.0.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-30 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919330#comment-16919330
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

I attached output of 2.3.0 and 2.3.4 in one comment as below. Let me see the 
script, too.

```
$ spark-2.3.0-bin-hadoop2.6/bin/spark-submit --version
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
 
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212
Branch master
Compiled by user sameera on 2018-02-22T19:24:38Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url g...@github.com:sameeragarwal/spark.git
Type --help for more information.
$ spark-2.3.4-bin-hadoop2.6/bin/spark-submit --version
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.4
/_/
 
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212
Branch 
Compiled by user on 2019-08-26T08:29:39Z
Revision 
Url 
Type --help for more information.
```

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28913) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-30 Thread Qiang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919300#comment-16919300
 ] 

Qiang Wang commented on SPARK-28913:


Sorry, I can not find the way to close the issue, just ignore it which was 
created by mistake.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
>  with 12 billion instances
> 
>
> Key: SPARK-28913
> URL: https://issues.apache.org/jira/browse/SPARK-28913
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
> val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
>   .groupByKey().zipWithIndex()
>   .persist(StorageLevel.MEMORY_AND_DISK_SER)
> val predataUser = 
> predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2

[jira] [Commented] (SPARK-28926) CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-30 Thread Qiang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919299#comment-16919299
 ] 

Qiang Wang commented on SPARK-28926:


Sorry, I can not find the way to close the issue, just ignore it which was 
created by mistake.

> CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for 
> datasets  with 12 billion instances
> 
>
> Key: SPARK-28926
> URL: https://issues.apache.org/jira/browse/SPARK-28926
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
> val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
>   .groupByKey().zipWithIndex()
>   .persist(StorageLevel.MEMORY_AND_DISK_SER)
> val predataUser = 
>

[jira] [Created] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-30 Thread Qiang Wang (Jira)

Qiang Wang created SPARK-28927:
--

 Summary: ArrayIndexOutOfBoundsException and Not-stable AUC metrics 
in ALS for datasets with 12 billion instances
 Key: SPARK-28927
 URL: https://issues.apache.org/jira/browse/SPARK-28927
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.1
Reporter: Qiang Wang


The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 
325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
java.lang.ArrayIndexOutOfBoundsException: 6741 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
 at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{quote}
This exception happened sometimes.  And we also found that the AUC metric was 
not stable when evaluating the inner product of the user factors and the item 
factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
which was not stable for production environment. 

Dataset capacity: ~12 billion ratings
Here is the our code:
val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
y._2.toFloat)))
  .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
ALSData(user:Int, item:Int, rating:Float) extends Serializable
val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
val als = new ALS
val paramMap = ParamMap(als.alpha -> 25000).
  put(als.checkpointInterval, 5).
  put(als.implicitPrefs, true).
  put(als.itemCol, "item").
  put(als.maxIter, 60).
  put(als.nonnegative, false).
  put(als.numItemBlocks, 600).
  put(als.numUserBlocks, 600).
  put(als.regParam, 4.5).
  put(als.rank, 25).
  put(als.userCol, "user")
als.fit(ratingData, paramMap)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To

[jira] [Created] (SPARK-28926) CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-30 Thread Qiang Wang (Jira)

Qiang Wang created SPARK-28926:
--

 Summary: CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC 
metrics in ALS for datasets  with 12 billion instances
 Key: SPARK-28926
 URL: https://issues.apache.org/jira/browse/SPARK-28926
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.1
Reporter: Qiang Wang
Assignee: Xiangrui Meng


The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 
325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
java.lang.ArrayIndexOutOfBoundsException: 6741 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
 at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{quote}
This exception happened sometimes.  And we also found that the AUC metric was 
not stable when evaluating the inner product of the user factors and the item 
factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
which was not stable for production environment. 

Dataset capacity: ~12 billion ratings
 Here is the our code:
{code:java}
val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
  .groupByKey().zipWithIndex()
  .persist(StorageLevel.MEMORY_AND_DISK_SER)
val predataUser = 
predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
  .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ 
b).map(r=>(r._1,r._2.toIterable))
  .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER)
//x._2 is the item_id, y._1 is the user_id, y._2 is the rating
val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
y._2.toFloat)))
  .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)

case class ALSData(user:Int, item:Int, rating:Float) extends Serializable
val ratingData = trainData.map(x =>

[jira] [Resolved] (SPARK-28872) Will Spark SQL suport the auto analyze for table or partitions like hive by seting hive.stats.autogather=true.

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28872.
--
Resolution: Invalid

> Will Spark SQL suport the auto analyze for table or partitions like hive by 
> seting hive.stats.autogather=true.
> --
>
> Key: SPARK-28872
> URL: https://issues.apache.org/jira/browse/SPARK-28872
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shao
>Priority: Major
>
> Like the summary, Will the spark sql suport the auto analyze for table or 
> tartitions in the future?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28874) Pyspark date_format add one years in the last days off year

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28874.
--
Resolution: Invalid

> Pyspark date_format add one years in the last days off year
> ---
>
> Key: SPARK-28874
> URL: https://issues.apache.org/jira/browse/SPARK-28874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Luis
>Priority: Major
>
> Pyspark date_format add one years in the last days off year :
> Example :
> {code:python}
> from pyspark.sql.functions import date_format, lit
> spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
> {code}
> {code}
> +---+
> |date_format(2010-12-26, -MM-dd)|
> +---+
> | 2011-12-26|
> +---+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28874) Pyspark date_format add one years in the last days off year

2019-08-30 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919294#comment-16919294
 ] 

Hyukjin Kwon commented on SPARK-28874:
--

Use {{y}}. {{Y}} is {{Y   week-based-year year  
1996; 96}}. See 
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html

> Pyspark date_format add one years in the last days off year
> ---
>
> Key: SPARK-28874
> URL: https://issues.apache.org/jira/browse/SPARK-28874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Luis
>Priority: Major
>
> Pyspark date_format add one years in the last days off year :
> Example :
> {code:python}
> from pyspark.sql.functions import date_format, lit
> spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
> {code}
> {code}
> +---+
> |date_format(2010-12-26, -MM-dd)|
> +---+
> | 2011-12-26|
> +---+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14

2019-08-30 Thread Eric (Jira)

Eric created SPARK-28925:


 Summary: Update Kubernetes-client to 4.4.2 to be compatible with 
Kubernetes 1.13 and 1.14
 Key: SPARK-28925
 URL: https://issues.apache.org/jira/browse/SPARK-28925
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.3
Reporter: Eric


Hello,

If you use Spark with Kubernetes 1.13 or 1.14 you will see this error:
{code:java}
{"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": 
"org.apache.spark.internal.Logging", 
"thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to request 
1 executors from Kubernetes."}
{"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": 
"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", 
"thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: HTTP 
403, Status: 403 - "}
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
{code}
Apparently the bug is fixed here: 
[https://github.com/fabric8io/kubernetes-client/pull/1669]


We have currently compiled Spark source code with Kubernetes-client 4.4.2 and 
it's working great on our cluster. We are using Kubernetes 1.13.10.

 

Could it be possible to update that dependency version?

 

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28874) Pyspark date_format add one years in the last days off year

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28874:
-
Summary: Pyspark date_format add one years in the last days off year  (was: 
Pyspark bug in date_format)

> Pyspark date_format add one years in the last days off year
> ---
>
> Key: SPARK-28874
> URL: https://issues.apache.org/jira/browse/SPARK-28874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Luis
>Priority: Major
>
> Pyspark date_format add one years in the last days off year :
> Example :
> {code:python}
> from pyspark.sql.functions import date_format, lit
> spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
> {code}
> {code}
> +---+
> |date_format(2010-12-26, -MM-dd)|
> +---+
> | 2011-12-26|
> +---+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28874) Pyspark bug in date_format

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28874:
-
Description: 
Pyspark date_format add one years in the last days off year :

Example :
{code:python}
spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
{code}

{code}
+---+
|date_format(2010-12-26, -MM-dd)|
+---+
| 2011-12-26|
+---+
{code}
 

  was:
Pyspark date_format add one years in the last days off year :

Example :
{code:python}
from datetime import datetime
from dateutil.relativedelta import relativedelta

import pandas as pd

from pyspark.sql.functions import date_format, col
from pyspark.sql.types import *

start_date = datetime(2010,1,1)
end_date = datetime(2055,1,1)
indx_ts = pd.date_range(start_date.strftime('%m/%d/%Y'), 
end_date.strftime('%m/%d/%Y'), freq='D')
data_date = [ {"d":datetime.utcfromtimestamp(x.tolist()/1e9)} for x in 
indx_ts.values ]

df_p = spark.createDataFrame(data_date,StructType([StructField('d', DateType(), 
True)]))
df_string = df_p.withColumn("date_string" ,date_format(col("d"), "-MM-dd"))
df_string.filter("d!=date_string").show(1000)
{code}

{code}
+--+---+
| d|date_string|
+--+---+
|2010-12-26| 2011-12-26|
|2010-12-27| 2011-12-27|
|2010-12-28| 2011-12-28|
|2010-12-29| 2011-12-29|
|2010-12-30| 2011-12-30|
|2010-12-31| 2011-12-31|
|2012-12-30| 2013-12-30|
|2012-12-31| 2013-12-31|
|2013-12-29| 2014-12-29|
|2013-12-30| 2014-12-30|
|2013-12-31| 2014-12-31|
|2014-12-28| 2015-12-28|
|2014-12-29| 2015-12-29|
|2014-12-30| 2015-12-30|
|2014-12-31| 2015-12-31|
|2015-12-27| 2016-12-27|
|2015-12-28| 2016-12-28|
|2015-12-29| 2016-12-29|
|2015-12-30| 2016-12-30|
|2015-12-31| 2016-12-31|
|2017-12-31| 2018-12-31|
|2018-12-30| 2019-12-30|
|2018-12-31| 2019-12-31|
|2019-12-29| 2020-12-29|
|2019-12-30| 2020-12-30|
|2019-12-31| 2020-12-31|
|2020-12-27| 2021-12-27|
|2020-12-28| 2021-12-28|
|2020-12-29| 2021-12-29|
|2020-12-30| 2021-12-30|
|2020-12-31| 2021-12-31|
|2021-12-26| 2022-12-26|
|2021-12-27| 2022-12-27|
|2021-12-28| 2022-12-28|
|2021-12-29| 2022-12-29|
|2021-12-30| 2022-12-30|
|2021-12-31| 2022-12-31|
|2023-12-31| 2024-12-31|
|2024-12-29| 2025-12-29|
|2024-12-30| 2025-12-30|
|2024-12-31| 2025-12-31|
|2025-12-28| 2026-12-28|
|2025-12-29| 2026-12-29|
|2025-12-30| 2026-12-30|
|2025-12-31| 2026-12-31|
|2026-12-27| 2027-12-27|
|2026-12-28| 2027-12-28|
|2026-12-29| 2027-12-29|
|2026-12-30| 2027-12-30|
|2026-12-31| 2027-12-31|
|2027-12-26| 2028-12-26|
|2027-12-27| 2028-12-27|
|2027-12-28| 2028-12-28|
|2027-12-29| 2028-12-29|
|2027-12-30| 2028-12-30|
|2027-12-31| 2028-12-31|
|2028-12-31| 2029-12-31|
|2029-12-30| 2030-12-30|
|2029-12-31| 2030-12-31|
|2030-12-29| 2031-12-29|
|2030-12-30| 2031-12-30|
|2030-12-31| 2031-12-31|
|2031-12-28| 2032-12-28|
|2031-12-29| 2032-12-29|
|2031-12-30| 2032-12-30|
|2031-12-31| 2032-12-31|
|2032-12-26| 2033-12-26|
|2032-12-27| 2033-12-27|
|2032-12-28| 2033-12-28|
|2032-12-29| 2033-12-29|
|2032-12-30| 2033-12-30|
|2032-12-31| 2033-12-31|
|2034-12-31| 2035-12-31|
|2035-12-30| 2036-12-30|
|2035-12-31| 2036-12-31|
|2036-12-28| 2037-12-28|
|2036-12-29| 2037-12-29|
|2036-12-30| 2037-12-30|
|2036-12-31| 2037-12-31|
|2037-12-27| 2038-12-27|
|2037-12-28| 2038-12-28|
|2037-12-29| 2038-12-29|
|2037-12-30| 2038-12-30|
|2037-12-31| 2038-12-31|
|2038-12-26| 2039-12-26|
|2038-12-27| 2039-12-27|
|2038-12-28| 2039-12-28|
|2038-12-29| 2039-12-29|
|2038-12-30| 2039-12-30|
|2038-12-31| 2039-12-31|
|2040-12-30| 2041-12-30|
|2040-12-31| 2041-12-31|
|2041-12-29| 2042-12-29|
|2041-12-30| 2042-12-30|
|2041-12-31| 2042-12-31|
|2042-12-28| 2043-12-28|
|2042-12-29| 2043-12-29|
|2042-12-30| 2043-12-30|
|2042-12-31| 2043-12-31|
|2043-12-27| 2044-12-27|
|2043-12-28| 2044-12-28|
|2043-12-29| 2044-12-29|
|2043-12-30| 2044-12-30|
|2043-12-31| 2044-12-31|
|2045-12-31| 2046-12-31|
|2046-12-30| 2047-12-30|
|2046-12-31| 2047-12-31|
|2047-12-29| 2048-12-29|
|2047-12-30| 2048-12-30|
|2047-12-31| 2048-12-31|
|2048-12-27| 2049-12-27|
|2048-12-28| 2049-12-28|
|2048-12-29| 2049-12-29|
|2048-12-30| 2049-12-30|
|2048-12-31| 2049-12-31|
|2049-12-26| 2050-12-26|
|2049-12-27| 2050-12-27|
|2049-12-28| 2050-12-28|
|2049-12-29| 2050-12-29|
|2049-12-30| 2050-12-30|
|2049-12-31| 2050-12-31|
|2051-12-31| 2052-12-31|
|2052-12-29| 2053-12-29|
|2052-12-30| 2053-12-30|
|2052-12-31| 2053-12-31|
|2053-12-28| 2054-12-28|
|2053-12-29| 2054-12-29|
|2053-12-30| 2054-12-30|
|2053-12-31| 2054-12-31|
|2054-12-27| 2055-12-27|
|2054-12-28| 2055-12-28|
|2054-12-29| 2055-12-29|
|2054-12-30| 2055-12-30|
|2054-12-31| 2055-12-31|
+--+---+
{code}
 


> Pyspark bug in date_format
> --
>
> Key: SPARK-28874
> URL: https://issues.apache.org/jira/browse/SPARK-28874
>

[jira] [Resolved] (SPARK-28668) Support the V2SessionCatalog with AlterTable commands

2019-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28668.
-
Fix Version/s: 3.0.0
 Assignee: Burak Yavuz
   Resolution: Fixed

> Support the V2SessionCatalog with AlterTable commands
> -
>
> Key: SPARK-28668
> URL: https://issues.apache.org/jira/browse/SPARK-28668
> Project: Spark
>  Issue Type: Planned Work
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 3.0.0
>
>
> We need to support the V2SessionCatalog with AlterTable commands so that V2 
> DataSources can leverage DDL through SQL ALTER TABLE commands.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28874) Pyspark bug in date_format

2019-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28874:
-
Description: 
Pyspark date_format add one years in the last days off year :

Example :
{code:python}
from pyspark.sql.functions import date_format, lit

spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
{code}

{code}
+---+
|date_format(2010-12-26, -MM-dd)|
+---+
| 2011-12-26|
+---+
{code}
 

  was:
Pyspark date_format add one years in the last days off year :

Example :
{code:python}
spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
{code}

{code}
+---+
|date_format(2010-12-26, -MM-dd)|
+---+
| 2011-12-26|
+---+
{code}
 


> Pyspark bug in date_format
> --
>
> Key: SPARK-28874
> URL: https://issues.apache.org/jira/browse/SPARK-28874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Luis
>Priority: Major
>
> Pyspark date_format add one years in the last days off year :
> Example :
> {code:python}
> from pyspark.sql.functions import date_format, lit
> spark.range(1).select(date_format(lit("2010-12-26"), "-MM-dd")).show()
> {code}
> {code}
> +---+
> |date_format(2010-12-26, -MM-dd)|
> +---+
> | 2011-12-26|
> +---+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 103 matches

Mail list logo