date:20181211

[jira] [Issue Comment Deleted] (SPARK-26335) Add an option for Dataset#show not to care about wide characters when padding them

2018-12-11 Thread Keiji Yoshida (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keiji Yoshida updated SPARK-26335:
--
Comment: was deleted

(was: A screenshot of OASIS has been attached.)

> Add an option for Dataset#show not to care about wide characters when padding 
> them
> --
>
> Key: SPARK-26335
> URL: https://issues.apache.org/jira/browse/SPARK-26335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Keiji Yoshida
>Priority: Major
> Attachments: Screen Shot 2018-12-11 at 17.53.54.png
>
>
> h2. Issue
> https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care 
> about wide characters when padding them. That is useful for humans to read a 
> result of Dataset#show. On the other hand, that makes it impossible for 
> programs to parse a result of Dataset#show because each cell's length can be 
> different from its header's length. My company develops and manages a 
> Jupyter/Apache Zeppelin-like visualization tool named "OASIS" 
> ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
>  On this application, a result of Dataset#show on a Scala or Python process 
> is parsed to visualize it as an HTML table format. (A screenshot of OASIS has 
> been attached to this ticket as a file named "Screen Shot 2018-12-11 at 
> 17.53.54.png".)
> h2. Solution
> Add an option for Dataset#show not to care about wide characters when padding 
> them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26335) Add an option for Dataset#show not to care about wide characters when padding them

2018-12-11 Thread Keiji Yoshida (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keiji Yoshida updated SPARK-26335:
--
Description: 
h2. Issue

https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about 
wide characters when padding them. That is useful for humans to read a result 
of Dataset#show. On the other hand, that makes it impossible for programs to 
parse a result of Dataset#show because each cell's length can be different from 
its header's length. My company develops and manages a Jupyter/Apache 
Zeppelin-like visualization tool named "OASIS" 
([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
 On this application, a result of Dataset#show on a Scala or Python process is 
parsed to visualize it as an HTML table format. (A screenshot of OASIS has been 
attached to this ticket as a file named "Screen Shot 2018-12-11 at 
17.53.54.png".)
h2. Solution

Add an option for Dataset#show not to care about wide characters when padding 
them.

  was:
h2. Issue

https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about 
wide characters when padding them. That is useful for humans to read a result 
of Dataset#show. On the other hand, that makes it impossible for programs to 
parse a result of Dataset#show because each cell's length can be different from 
its header's length. My company develops and manages a Jupyter/Apache 
Zeppelin-like visualization tool named "OASIS" 
([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
 On this application, a result of Dataset#show on a Scala or Python process is 
parsed to visualize it as an HTML table format. ()
h2. Solution

Add an option for Dataset#show not to care about wide characters when padding 
them.


> Add an option for Dataset#show not to care about wide characters when padding 
> them
> --
>
> Key: SPARK-26335
> URL: https://issues.apache.org/jira/browse/SPARK-26335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Keiji Yoshida
>Priority: Major
> Attachments: Screen Shot 2018-12-11 at 17.53.54.png
>
>
> h2. Issue
> https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care 
> about wide characters when padding them. That is useful for humans to read a 
> result of Dataset#show. On the other hand, that makes it impossible for 
> programs to parse a result of Dataset#show because each cell's length can be 
> different from its header's length. My company develops and manages a 
> Jupyter/Apache Zeppelin-like visualization tool named "OASIS" 
> ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
>  On this application, a result of Dataset#show on a Scala or Python process 
> is parsed to visualize it as an HTML table format. (A screenshot of OASIS has 
> been attached to this ticket as a file named "Screen Shot 2018-12-11 at 
> 17.53.54.png".)
> h2. Solution
> Add an option for Dataset#show not to care about wide characters when padding 
> them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26335) Add an option for Dataset#show not to care about wide characters when padding them

2018-12-11 Thread Keiji Yoshida (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keiji Yoshida updated SPARK-26335:
--
Description: 
h2. Issue

https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about 
wide characters when padding them. That is useful for humans to read a result 
of Dataset#show. On the other hand, that makes it impossible for programs to 
parse a result of Dataset#show because each cell's length can be different from 
its header's length. My company develops and manages a Jupyter/Apache 
Zeppelin-like visualization tool named "OASIS" 
([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
 On this application, a result of Dataset#show on a Scala or Python process is 
parsed to visualize it as an HTML table format. ()
h2. Solution

Add an option for Dataset#show not to care about wide characters when padding 
them.

  was:
https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about 
wide characters when padding them. That is useful for humans to read a result 
of Dataset#show. On the other hand, that makes it impossible for programs to 
parse a result of Dataset#show because each cell's length can be difference 
from its header's length. My company develops and manages a Jupyter/Apache 
Zeppelin-like visualization tool named "OASIS" 
([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
 On this application, a result of Dataset#show is parsed to visualize it as an 
HTML table format.

So, it is preferable to add an option for Dataset#show not to care about wide 
characters when padding them by adding a parameter such as "fixedColLength" to 
Dataset#show.


> Add an option for Dataset#show not to care about wide characters when padding 
> them
> --
>
> Key: SPARK-26335
> URL: https://issues.apache.org/jira/browse/SPARK-26335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Keiji Yoshida
>Priority: Major
> Attachments: Screen Shot 2018-12-11 at 17.53.54.png
>
>
> h2. Issue
> https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care 
> about wide characters when padding them. That is useful for humans to read a 
> result of Dataset#show. On the other hand, that makes it impossible for 
> programs to parse a result of Dataset#show because each cell's length can be 
> different from its header's length. My company develops and manages a 
> Jupyter/Apache Zeppelin-like visualization tool named "OASIS" 
> ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]).
>  On this application, a result of Dataset#show on a Scala or Python process 
> is parsed to visualize it as an HTML table format. ()
> h2. Solution
> Add an option for Dataset#show not to care about wide characters when padding 
> them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718527#comment-16718527
 ] 

ASF GitHub Bot commented on SPARK-23674:


HyukjinKwon opened a new pull request #23263: [SPARK-23674][ML] Adds Spark ML 
Events
URL: https://github.com/apache/spark/pull/23263
 
 
   ## What changes were proposed in this pull request?
   
   This PR proposes to add ML events so that other developers can track and add 
some actions for them.
   
   ## Introduction
   
   ML events (like SQL events) can be quite useful when people want to track 
and make some actions for corresponding ML operations. For instance, I have 
been working on integrating 
   Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). 
With some custom changes with this PR, I can visualise ML pipeline as below:
   
   
![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)
   
   Another good thing that might have to be considered is, that we can interact 
this with other SQL/Streaming events. For instance, where the input `Dataset` 
is originated. For instance, with current Apache Spark, I can visualise SQL 
operations as below:
   
   ![screen shot 2018-12-10 at 9 41 36 
am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)
   
   I think we can combine those existing lineages together to easily understand 
where the data comes and goes. Currently, ML side is a hole so the lineages 
can't be connected for the current Apache Spark ..
   
   To add up, I think it's not to mention how useful it is to track the 
SQL/Streaming operations. Likewise, I would like to propose ML events as well 
(as lowest stability `@Unstable` APIs for now - no guarantee about stability).
   
   ## Implementation Details
   
   ### Sends event (but not expose ML specific listener)
   
   **`mllib/src/main/scala/org/apache/spark/ml/events.scala`**
   
   ```scala
   @Unstable
   case class ...StartEvent(caller, input)
   @Unstable
   case class ...EndEvent(caller, output)
   
   object MLEvents {
 // Wrappers to send events:
 // def with...Event(body) = {
 //   body()
 //   SparkContext.getOrCreate().listenerBus.post(event)
 // }
   }
   ```
   
   This way mimics both:
   
   **1. Catalog events (see 
`org/apache/spark/sql/catalyst/catalog/events.scala`)**
   
   - This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 
   
   - It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
   which delegates the operations to `ExternalCatalog`
   
   This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possible instances but IMHO it's too 
invasive. Also, exposing another ML specific listener sounds a bit too much at 
this stage. Therefore, I simply borrowed file name and structures here
   
   **2. SQL execution events (see 
`org/apache/spark/sql/execution/SQLExecution.scala`)**
   
   - Add an object that wraps a body to send events
   
   Current apporach is rather close to this. It has a `with...` wrapper to send 
events. I borrowed this approach to be consistent.
   
   
   ### Add `...Impl` methods to wrap each to send events
   
   **`mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**
   
   ```diff
   - def save(...) = { saveImpl(...) }
   + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
 def saveImpl(...): Unit = ...
   ```
   
 Note that `saveImpl` was already implemented unlike other instances below.
   
   
   ```diff
   - def load(...): T
   + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
   + def loadImpl(...): T
   ```
   
   **`mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**
   
   ```diff
   - def fit(...): Model
   + def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) }
   + def fitImpl(...): Model
   ```
   
   **`mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**
   
   ```diff
   - def transform(...): DataFrame
   + def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl(...) }
   + def transformImpl(...): DataFrame
   ```
   
   This approach follows the existing way as below in ML:
   
   **1. `transform` and `transformImpl`**
   
   
https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213
   
   
https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196

[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718526#comment-16718526
 ] 

ASF GitHub Bot commented on SPARK-23674:


felixcheung closed pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
URL: https://github.com/apache/spark/pull/23263
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala
index 1247882d6c1bd..a3c4db06862f8 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala
@@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends 
PipelineStage {
* Fits a model to the input data.
*/
   @Since("2.0.0")
-  def fit(dataset: Dataset[_]): M
+  def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) {
+fitImpl(dataset)
+  }
+
+  /**
+   * `fit()` handles events and then calls this method. Subclasses should 
override this
+   * method to implement the actual fiting a model to the input data.
+   */
+  @Since("3.0.0")
+  protected def fitImpl(dataset: Dataset[_]): M = {
+// Keep this default body for backward compatibility.
+throw new UnsupportedOperationException("fitImpl is not implemented.")
+  }
 
   /**
* Fits multiple models to the input data with multiple sets of parameters.
diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
index 103082b7b9766..1c781faff129e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
@@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") (
* @return fitted pipeline
*/
   @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): PipelineModel = {
+  override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset)
+  override protected def fitImpl(dataset: Dataset[_]): PipelineModel = {
 transformSchema(dataset.schema, logging = true)
 val theStages = $(stages)
 // Search for the last estimator.
@@ -210,7 +211,7 @@ object Pipeline extends MLReadable[Pipeline] {
 /** Checked against metadata when loading model */
 private val className = classOf[Pipeline].getName
 
-override def load(path: String): Pipeline = {
+override protected def loadImpl(path: String): Pipeline = {
   val (uid: String, stages: Array[PipelineStage]) = 
SharedReadWrite.load(className, sc, path)
   new Pipeline(uid).setStages(stages)
 }
@@ -301,7 +302,8 @@ class PipelineModel private[ml] (
   }
 
   @Since("2.0.0")
-  override def transform(dataset: Dataset[_]): DataFrame = {
+  override def transform(dataset: Dataset[_]): DataFrame = 
super.transform(dataset)
+  override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
 stages.foldLeft(dataset.toDF)((cur, transformer) => 
transformer.transform(cur))
   }
@@ -344,7 +346,7 @@ object PipelineModel extends MLReadable[PipelineModel] {
 /** Checked against metadata when loading model */
 private val className = classOf[PipelineModel].getName
 
-override def load(path: String): PipelineModel = {
+override protected def loadImpl(path: String): PipelineModel = {
   val (uid: String, stages: Array[PipelineStage]) = 
SharedReadWrite.load(className, sc, path)
   val transformers = stages map {
 case stage: Transformer => stage
diff --git a/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala 
b/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
index d8f3dfa874439..3731ddae0160c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
@@ -94,7 +94,10 @@ abstract class Predictor[
   /** @group setParam */
   def setPredictionCol(value: String): Learner = set(predictionCol, 
value).asInstanceOf[Learner]
 
-  override def fit(dataset: Dataset[_]): M = {
+  // Explictly call parent's load. Otherwise, MiMa complains.
+  override def fit(dataset: Dataset[_]): M = super.fit(dataset)
+
+  override protected def fitImpl(dataset: Dataset[_]): M = {
 // This handles a few items such as schema validation.
 // Developers only need to implement train().
 transformSchema(dataset.schema, logging = true)
@@ -199,7 +202,8 @@ abstract class PredictionModel[FeaturesType, M <: 
PredictionModel[FeaturesType,
* @param dataset input dataset
* @return transformed dataset with [[predictionCol]] of type `Double`
*/
-  override def transform(dataset: Dataset[_]): DataFrame = {
+

[jira] [Comment Edited] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-12-11 Thread Ayush Chauhan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510
 ] 

Ayush Chauhan edited comment on SPARK-19185 at 12/12/18 6:52 AM:
-

Is this issue resolved because SPARK-22606 is still in progress? I am getting 
this error in spark 2.3.2 and spark-streaming-kafka-0-10. 

So this workaround is the solution to this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}


was (Author: ayushchauhan):
Is this issue resolved because 
[SPARK-22606|https://issues.apache.org/jira/browse/SPARK-22606] is still in 
progress? Because I am getting this error in spark 2.3.2 and 
spark-streaming-kafka-0-10. 

So this workaround is the solution of this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: streaming, windowing
> Fix For: 2.4.0
>
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at

[jira] [Comment Edited] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-12-11 Thread Ayush Chauhan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510
 ] 

Ayush Chauhan edited comment on SPARK-19185 at 12/12/18 6:52 AM:
-

Is this issue resolved because SPARK-22606 is still in progress? I am getting 
this error in spark 2.3.2 and spark-streaming-kafka-0-10. 

So this workaround is the solution to this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}


was (Author: ayushchauhan):
Is this issue resolved because SPARK-22606 is still in progress? I am getting 
this error in spark 2.3.2 and spark-streaming-kafka-0-10. 

So this workaround is the solution to this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: streaming, windowing
> Fix For: 2.4.0
>
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at

[jira] [Comment Edited] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-12-11 Thread Ayush Chauhan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510
 ] 

Ayush Chauhan edited comment on SPARK-19185 at 12/12/18 6:51 AM:
-

Is this issue resolved because 
[SPARK-22606|https://issues.apache.org/jira/browse/SPARK-22606] is still in 
progress? Because I am getting this error in spark 2.3.2 and 
spark-streaming-kafka-0-10. 

So this workaround is the solution of this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}


was (Author: ayushchauhan):
Is this issue resolved? Because I am getting this error in spark 2.3.2 and 
spark-streaming-kafka-0-10. 

So this workaround is the solution of this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: streaming, windowing
> Fix For: 2.4.0
>
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-12-11 Thread Ayush Chauhan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510
 ] 

Ayush Chauhan commented on SPARK-19185:
---

Is this issue resolved? Because I am getting this error in spark 2.3.2 and 
spark-streaming-kafka-0-10. 

So this workaround is the solution of this problem? Is there any downside of 
setting this property?
{code:java}
spark.streaming.kafka.consumer.cache.enabled=false 
{code}

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: streaming, windowing
> Fix For: 2.4.0
>
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
>

[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718438#comment-16718438
 ] 

ASF GitHub Bot commented on SPARK-26265:


viirya opened a new pull request #23294: [SPARK-26265][Core][Followup] Put 
freePage into a finally block
URL: https://github.com/apache/spark/pull/23294
 
 
   ## What changes were proposed in this pull request?
   
   Based on the 
[comment](https://github.com/apache/spark/pull/23272#discussion_r240735509), it 
seems to be better to put `freePage` into a `finally` block. This patch as a 
follow-up to do so.
   
   ## How was this patch tested?
   
   Existing tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
> --
>
> Key: SPARK-26265
> URL: https://issues.apache.org/jira/browse/SPARK-26265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: qian han
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> The application is running on a cluster with 72000 cores and 182000G mem.
> Enviroment:
> |spark.dynamicAllocation.minExecutors|5|
> |spark.dynamicAllocation.initialExecutors|30|
> |spark.dynamicAllocation.maxExecutors|400|
> |spark.executor.cores|4|
> |spark.executor.memory|20g|
>  
>   
> Stage description:
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
>  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193)
>  
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  java.lang.reflect.Method.invoke(Method.java:498) 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>  org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) 
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>  
> jstack information as follow:
> Found one Java-level deadlock: = 
> "Thread-ScriptTransformation-Feed": waiting to lock monitor 
> 0x00e0cb18 (object 0x0002f1641538, a 
> org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task 
> launch worker for task 18899" "Executor task launch worker for task 18899": 
> waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by 
> "Thread-ScriptTransformation-Feed" Java stack information for the threads 
> listed above: === 
> "Thread-ScriptTransformation-Feed": at 
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332)
>  - waiting to lock <0x0002f1641538> (a 
> org.apache.spark.memory.TaskMemoryManager) at 
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at 
> org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66)
>  at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274)
>  - locked <0x000302faa3b0> (a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313)
>  at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
>

[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718428#comment-16718428
 ] 

ASF GitHub Bot commented on SPARK-26327:


dongjoon-hyun closed pull request #23287: [SPARK-26327][SQL][BACKPORT-2.4] Bug 
fix for `FileSourceScanExec` metrics update
URL: https://github.com/apache/spark/pull/23287
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
index 36ed016773b67..5433c30afd6bb 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
@@ -185,19 +185,14 @@ case class FileSourceScanExec(
   partitionSchema = relation.partitionSchema,
   relation.sparkSession.sessionState.conf)
 
+  private var metadataTime = 0L
+
   @transient private lazy val selectedPartitions: Seq[PartitionDirectory] = {
 val optimizerMetadataTimeNs = 
relation.location.metadataOpsTimeNs.getOrElse(0L)
 val startTime = System.nanoTime()
 val ret = relation.location.listFiles(partitionFilters, dataFilters)
 val timeTakenMs = ((System.nanoTime() - startTime) + 
optimizerMetadataTimeNs) / 1000 / 1000
-
-metrics("numFiles").add(ret.map(_.files.size.toLong).sum)
-metrics("metadataTime").add(timeTakenMs)
-
-val executionId = 
sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
-SQLMetrics.postDriverMetricUpdates(sparkContext, executionId,
-  metrics("numFiles") :: metrics("metadataTime") :: Nil)
-
+metadataTime = timeTakenMs
 ret
   }
 
@@ -308,6 +303,8 @@ case class FileSourceScanExec(
   }
 
   private lazy val inputRDD: RDD[InternalRow] = {
+// Update metrics for taking effect in both code generation node and 
normal node.
+updateDriverMetrics()
 val readFile: (PartitionedFile) => Iterator[InternalRow] =
   relation.fileFormat.buildReaderWithPartitionValues(
 sparkSession = relation.sparkSession,
@@ -524,6 +521,19 @@ case class FileSourceScanExec(
 }
   }
 
+  /**
+   * Send the updated metrics to driver, while this function calling, 
selectedPartitions has
+   * been initialized. See SPARK-26327 for more detail.
+   */
+  private def updateDriverMetrics() = {
+metrics("numFiles").add(selectedPartitions.map(_.files.size.toLong).sum)
+metrics("metadataTime").add(metadataTime)
+
+val executionId = 
sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
+SQLMetrics.postDriverMetricUpdates(sparkContext, executionId,
+  metrics("numFiles") :: metrics("metadataTime") :: Nil)
+  }
+
   override def doCanonicalize(): FileSourceScanExec = {
 FileSourceScanExec(
   relation,
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
index 085a445488480..c550bf20b92b5 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
@@ -570,4 +570,19 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 }
   }
+
+  test("SPARK-26327: FileSourceScanExec metrics") {
+withTable("testDataForScan") {
+  spark.range(10).selectExpr("id", "id % 3 as p")
+.write.partitionBy("p").saveAsTable("testDataForScan")
+  // The execution plan only has 1 FileScan node.
+  val df = spark.sql(
+"SELECT * FROM testDataForScan WHERE p = 1")
+  testSparkPlanMetrics(df, 1, Map(
+0L -> (("Scan parquet default.testdataforscan", Map(
+  "number of output rows" -> 3L,
+  "number of files" -> 2L
+  )
+}
+  }
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Metrics in FileSourceScanExec not update correctly while 
> relation.partitionSchema is set
> 
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li

[jira] [Resolved] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-11 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26333.

Resolution: Not A Bug

> FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
> 
>
> Key: SPARK-26333
> URL: https://issues.apache.org/jira/browse/SPARK-26333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot 
> be read.". I try to invoke logFile2.canRead after invoking 
> "setReadable(false, false)" . And I find that the result of 
> "logFile2.canRead" is true but in my ubuntu16.04 return false.
> The environment：
> RedHat:
> Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 
> 22:26:13 UTC 2017
> JDK
> Java version: 1.8.0_151, vendor: Oracle Corporation
> {code:java}
>  org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26332) Spark sql write orc table on viewFS throws exception

2018-12-11 Thread Bang Xiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718393#comment-16718393
 ] 

Bang Xiao commented on SPARK-26332:
---

[~joshrosen]  Hi, I found you maintain the branch of hive1.2.1-spark2, Could 
you please check this？

> Spark sql write orc table on viewFS throws exception
> 
>
> Key: SPARK-26332
> URL: https://issues.apache.org/jira/browse/SPARK-26332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Bang Xiao
>Priority: Major
>
> Using SparkSQL write orc table on viewFs will cause exception:
> {code:java}
> Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.fs.viewfs.NotInMountpointException: 
> getDefaultReplication on empty path is invalid
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634)
> at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103)
> at 
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120)
> at 
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:352)
> at 
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
> at 
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
> at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2413)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
> at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
> ... 8 more
> Suppressed: org.apache.hadoop.fs.viewfs.NotInMountpointException: 
> getDefaultReplication on empty path is invalid
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634)
> at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103)
> at 
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120)
> at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2425)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
> at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.close(HiveFileFormat.scala:154)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:275)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1423)
> ... 9 more{code}
> this exception can be reproduced by follow sqls:
> {code:java}
> spark-sql> CREATE EXTERNAL TABLE test_orc(test_id INT, test_age INT, 
> test_rank INT) STORED AS ORC LOCATION 
> 'viewfs://nsX/user/hive/warehouse/ultraman_tmp.db/test_orc';
> spark-sql> CREATE TABLE source(id INT, age INT, rank INT);
> spark-sql> INSERT INTO source VALUES(1,1,1);
> spark-sql> INSERT OVERWRITE TABLE test_orc SELECT * FROM source;
> {code}
> this is related to https://issues.apache.org/jira/browse/HIVE-10790.  and 
> resolved

[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-11 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718397#comment-16718397
 ] 

Marcelo Vanzin commented on SPARK-26333:


{{setReadable}} works as root. But being root, you can read anything, 
regardless of permissions.

> FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
> 
>
> Key: SPARK-26333
> URL: https://issues.apache.org/jira/browse/SPARK-26333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot 
> be read.". I try to invoke logFile2.canRead after invoking 
> "setReadable(false, false)" . And I find that the result of 
> "logFile2.canRead" is true but in my ubuntu16.04 return false.
> The environment：
> RedHat:
> Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 
> 22:26:13 UTC 2017
> JDK
> Java version: 1.8.0_151, vendor: Oracle Corporation
> {code:java}
>  org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26343) Speed up running the kubernetes integration tests locally

2018-12-11 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718396#comment-16718396
 ] 

Marcelo Vanzin commented on SPARK-26343:


You can do that in sbt ("kubernetes-integration-tests/test"), but expanding 
that to maven would be good too.

> Speed up running the kubernetes integration tests locally
> -
>
> Key: SPARK-26343
> URL: https://issues.apache.org/jira/browse/SPARK-26343
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> The Kubernetes integration tests right now allow you to specify a docker tag 
> but even when you do it also requires a tgz to extract, but then it doesn't 
> really need that extracted version. We could make it easier/faster for folks 
> to run the integration tests locally by not requiring a distribution tar ball.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26342) Support for NFS mount for Kubernetes

2018-12-11 Thread Eric Carlson (JIRA)

Eric Carlson created SPARK-26342:


 Summary: Support for NFS mount for Kubernetes
 Key: SPARK-26342
 URL: https://issues.apache.org/jira/browse/SPARK-26342
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Eric Carlson


Currently only hostPath, emptyDir, and PVC volume types are accepted for 
Kubernetes-deployed drivers and executors.  Possibility to mount NFS paths 
would allow access to a common and easy-to-deploy shared storage solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26344) Support for flexVolume mount for Kubernetes

2018-12-11 Thread Eric Carlson (JIRA)

Eric Carlson created SPARK-26344:


 Summary: Support for flexVolume mount for Kubernetes
 Key: SPARK-26344
 URL: https://issues.apache.org/jira/browse/SPARK-26344
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Eric Carlson


Currently only hostPath, emptyDir, and PVC volume types are accepted for 
Kubernetes-deployed drivers and executors.

flexVolume types allow for pluggable volume drivers to be used in Kubernetes - 
a widely used example of this is the Rook deployment of CephFS, which provides 
a POSIX-compliant distributed filesystem integrated into K8s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718387#comment-16718387
 ] 

ASF GitHub Bot commented on SPARK-26333:


deshanxiao closed pull request #23282: 
[SPARK-26333][TEST]FsHistoryProviderSuite failed because setReadable doesn't 
work in RedHat
URL: https://github.com/apache/spark/pull/23282
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 
b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
index c1ae27aa940f6..e376ce85be345 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
@@ -176,6 +176,7 @@ class FsHistoryProviderSuite extends SparkFunSuite with 
BeforeAndAfter with Matc
   SparkListenerApplicationEnd(2L)
   )
 logFile2.setReadable(false, false)
+assume(!logFile2.canRead)
 
 updateAndCheck(provider) { list =>
   list.size should be (1)


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
> 
>
> Key: SPARK-26333
> URL: https://issues.apache.org/jira/browse/SPARK-26333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot 
> be read.". I try to invoke logFile2.canRead after invoking 
> "setReadable(false, false)" . And I find that the result of 
> "logFile2.canRead" is true but in my ubuntu16.04 return false.
> The environment：
> RedHat:
> Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 
> 22:26:13 UTC 2017
> JDK
> Java version: 1.8.0_151, vendor: Oracle Corporation
> {code:java}
>  org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at

[jira] [Created] (SPARK-26343) Running the kubernetes

2018-12-11 Thread holdenk (JIRA)

holdenk created SPARK-26343:
---

 Summary: Running the kubernetes 
 Key: SPARK-26343
 URL: https://issues.apache.org/jira/browse/SPARK-26343
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


The Kubernetes integration tests right now allow you to specify a docker tag 
but even when you do it also requires a tgz to extract, but then it doesn't 
really need that extracted version. We could make it easier/faster for folks to 
run the integration tests locally by not requiring a distribution tar ball.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26343) Speed up running the kubernetes integration tests locally

2018-12-11 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-26343:

Summary: Speed up running the kubernetes integration tests locally  (was: 
Running the kubernetes )

> Speed up running the kubernetes integration tests locally
> -
>
> Key: SPARK-26343
> URL: https://issues.apache.org/jira/browse/SPARK-26343
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> The Kubernetes integration tests right now allow you to specify a docker tag 
> but even when you do it also requires a tgz to extract, but then it doesn't 
> really need that extracted version. We could make it easier/faster for folks 
> to run the integration tests locally by not requiring a distribution tar ball.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-11 Thread deshanxiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718385#comment-16718385
 ] 

deshanxiao commented on SPARK-26333:


[~vanzin] Yes, you are right! Thank you very much! But why setReadable doesn't 
work as root? 

> FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
> 
>
> Key: SPARK-26333
> URL: https://issues.apache.org/jira/browse/SPARK-26333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot 
> be read.". I try to invoke logFile2.canRead after invoking 
> "setReadable(false, false)" . And I find that the result of 
> "logFile2.canRead" is true but in my ubuntu16.04 return false.
> The environment：
> RedHat:
> Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 
> 22:26:13 UTC 2017
> JDK
> Java version: 1.8.0_151, vendor: Oracle Corporation
> {code:java}
>  org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718368#comment-16718368
 ] 

ASF GitHub Bot commented on SPARK-26193:


asfgit closed pull request #23286: [SPARK-26193][SQL][Follow Up] Read metrics 
rename and display text changes
URL: https://github.com/apache/spark/pull/23286
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala 
b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
index 2a8d1dd995e27..35664ff515d4b 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
@@ -92,7 +92,7 @@ private[spark] class ShuffleMapTask(
   threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
 } else 0L
 
-dep.shuffleWriterProcessor.writeProcess(rdd, dep, partitionId, context, 
partition)
+dep.shuffleWriterProcessor.write(rdd, dep, partitionId, context, partition)
   }
 
   override def preferredLocations: Seq[TaskLocation] = preferredLocs
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala 
b/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala
index f5213157a9a85..5b0c7e9f2b0b4 100644
--- a/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala
+++ b/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala
@@ -41,7 +41,7 @@ private[spark] class ShuffleWriteProcessor extends 
Serializable with Logging {
* get from [[ShuffleManager]] and triggers rdd compute, finally return the 
[[MapStatus]] for
* this task.
*/
-  def writeProcess(
+  def write(
   rdd: RDD[_],
   dep: ShuffleDependency[_, _, _],
   partitionId: Int,
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala
index 9b05faaed0459..079ff25fcb67e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala
@@ -22,7 +22,7 @@ import java.util.Arrays
 import org.apache.spark._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.execution.metric.{SQLMetric, 
SQLShuffleMetricsReporter}
+import org.apache.spark.sql.execution.metric.{SQLMetric, 
SQLShuffleReadMetricsReporter}
 
 /**
  * The [[Partition]] used by [[ShuffledRowRDD]]. A post-shuffle partition
@@ -157,9 +157,9 @@ class ShuffledRowRDD(
   override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
 val shuffledRowPartition = split.asInstanceOf[ShuffledRowRDDPartition]
 val tempMetrics = context.taskMetrics().createTempShuffleReadMetrics()
-// `SQLShuffleMetricsReporter` will update its own metrics for SQL 
exchange operator,
+// `SQLShuffleReadMetricsReporter` will update its own metrics for SQL 
exchange operator,
 // as well as the `tempMetrics` for basic shuffle metrics.
-val sqlMetricsReporter = new SQLShuffleMetricsReporter(tempMetrics, 
metrics)
+val sqlMetricsReporter = new SQLShuffleReadMetricsReporter(tempMetrics, 
metrics)
 // The range of pre-shuffle partitions that we are fetching at here is
 // [startPreShufflePartitionIndex, endPreShufflePartitionIndex - 1].
 val reader =
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
index 0c2020572e721..da7b0c6f43fbc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
@@ -31,7 +31,7 @@ import org.apache.spark.sql.catalyst.expressions.{Attribute, 
BoundReference, Uns
 import 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
 import org.apache.spark.sql.catalyst.plans.physical._
 import org.apache.spark.sql.execution._
-import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics, 
SQLShuffleMetricsReporter, SQLShuffleWriteMetricsReporter}
+import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics, 
SQLShuffleReadMetricsReporter, SQLShuffleWriteMetricsReporter}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.util.MutablePair
@@ -50,7 +50,7 @@ case class ShuffleExchangeExec(
   private lazy val writeMetrics =

[jira] [Issue Comment Deleted] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-11 Thread deshanxiao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao updated SPARK-26333:
---
Comment: was deleted

(was: [~vanzin] No, I am not running as root.)

> FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
> 
>
> Key: SPARK-26333
> URL: https://issues.apache.org/jira/browse/SPARK-26333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot 
> be read.". I try to invoke logFile2.canRead after invoking 
> "setReadable(false, false)" . And I find that the result of 
> "logFile2.canRead" is true but in my ubuntu16.04 return false.
> The environment：
> RedHat:
> Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 
> 22:26:13 UTC 2017
> JDK
> Java version: 1.8.0_151, vendor: Oracle Corporation
> {code:java}
>  org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-11 Thread deshanxiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718337#comment-16718337
 ] 

deshanxiao commented on SPARK-26333:


[~vanzin] No, I am not running as root.

> FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
> 
>
> Key: SPARK-26333
> URL: https://issues.apache.org/jira/browse/SPARK-26333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot 
> be read.". I try to invoke logFile2.canRead after invoking 
> "setReadable(false, false)" . And I find that the result of 
> "logFile2.canRead" is true but in my ubuntu16.04 return false.
> The environment：
> RedHat:
> Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 
> 22:26:13 UTC 2017
> JDK
> Java version: 1.8.0_151, vendor: Oracle Corporation
> {code:java}
>  org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
>   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
>   at 
> org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20628) Keep track of nodes which are going to be shut down & avoid scheduling new tasks

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718305#comment-16718305
 ] 

ASF GitHub Bot commented on SPARK-20628:


SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of 
nodes (/ spot instances) which are going to be shutdown
URL: https://github.com/apache/spark/pull/19045#issuecomment-446420850
 
 
   **[Test build #7 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/7/testReport)**
 for PR 19045 at commit 
[`af048f5`](https://github.com/apache/spark/commit/af048f5753cd99b68d2e5f8d268c52a119a2d84a).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Keep track of nodes which are going to be shut down & avoid scheduling new 
> tasks
> 
>
> Key: SPARK-20628
> URL: https://issues.apache.org/jira/browse/SPARK-20628
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: holdenk
>Priority: Major
>
> Keep track of nodes which are going to be shut down. We considered adding 
> this for YARN but took a different approach, for instances where we can't 
> control instance termination though (EC2, GCE, etc.) this may make more sense.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718296#comment-16718296
 ] 

ASF GitHub Bot commented on SPARK-26265:


viirya commented on issue #23289: [SPARK-26265][Core][BRANCH-2.4] Fix deadlock 
in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator 
and TaskMemoryManager
URL: https://github.com/apache/spark/pull/23289#issuecomment-446418768
 
 
   Thanks @dongjoon-hyun @cloud-fan 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
> --
>
> Key: SPARK-26265
> URL: https://issues.apache.org/jira/browse/SPARK-26265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: qian han
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> The application is running on a cluster with 72000 cores and 182000G mem.
> Enviroment:
> |spark.dynamicAllocation.minExecutors|5|
> |spark.dynamicAllocation.initialExecutors|30|
> |spark.dynamicAllocation.maxExecutors|400|
> |spark.executor.cores|4|
> |spark.executor.memory|20g|
>  
>   
> Stage description:
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
>  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193)
>  
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  java.lang.reflect.Method.invoke(Method.java:498) 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>  org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) 
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>  
> jstack information as follow:
> Found one Java-level deadlock: = 
> "Thread-ScriptTransformation-Feed": waiting to lock monitor 
> 0x00e0cb18 (object 0x0002f1641538, a 
> org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task 
> launch worker for task 18899" "Executor task launch worker for task 18899": 
> waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by 
> "Thread-ScriptTransformation-Feed" Java stack information for the threads 
> listed above: === 
> "Thread-ScriptTransformation-Feed": at 
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332)
>  - waiting to lock <0x0002f1641538> (a 
> org.apache.spark.memory.TaskMemoryManager) at 
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at 
> org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66)
>  at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274)
>  - locked <0x000302faa3b0> (a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313)
>  at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
>

[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718297#comment-16718297
 ] 

ASF GitHub Bot commented on SPARK-26265:


viirya closed pull request #23289: [SPARK-26265][Core][BRANCH-2.4] Fix deadlock 
in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator 
and TaskMemoryManager
URL: https://github.com/apache/spark/pull/23289
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java 
b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
index 9b6cbab38cbcc..64650336c9371 100644
--- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
+++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
@@ -267,11 +267,18 @@ private MapIterator(int numRecords, Location loc, boolean 
destructive) {
 }
 
 private void advanceToNextPage() {
+  // SPARK-26265: We will first lock this `MapIterator` and then 
`TaskMemoryManager` when going
+  // to free a memory page by calling `freePage`. At the same time, it is 
possibly that another
+  // memory consumer first locks `TaskMemoryManager` and then this 
`MapIterator` when it
+  // acquires memory and causes spilling on this `MapIterator`. To avoid 
deadlock here, we keep
+  // reference to the page to free and free it after releasing the lock of 
`MapIterator`.
+  MemoryBlock pageToFree = null;
+
   synchronized (this) {
 int nextIdx = dataPages.indexOf(currentPage) + 1;
 if (destructive && currentPage != null) {
   dataPages.remove(currentPage);
-  freePage(currentPage);
+  pageToFree = currentPage;
   nextIdx --;
 }
 if (dataPages.size() > nextIdx) {
@@ -295,6 +302,9 @@ private void advanceToNextPage() {
   }
 }
   }
+  if (pageToFree != null) {
+freePage(pageToFree);
+  }
 }
 
 @Override
diff --git a/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java 
b/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java
index 0bbaea6b834b8..6aa577d1bf797 100644
--- a/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java
+++ b/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java
@@ -38,12 +38,12 @@ public long spill(long size, MemoryConsumer trigger) throws 
IOException {
 return used;
   }
 
-  void use(long size) {
+  public void use(long size) {
 long got = taskMemoryManager.acquireExecutionMemory(size, this);
 used += got;
   }
 
-  void free(long size) {
+  public void free(long size) {
 used -= size;
 taskMemoryManager.releaseExecutionMemory(size, this);
   }
diff --git 
a/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 
b/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
index 53a233f698c7a..278d28f7bf479 100644
--- 
a/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
+++ 
b/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
@@ -33,6 +33,8 @@
 
 import org.apache.spark.SparkConf;
 import org.apache.spark.executor.ShuffleWriteMetrics;
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.memory.TestMemoryConsumer;
 import org.apache.spark.memory.TaskMemoryManager;
 import org.apache.spark.memory.TestMemoryManager;
 import org.apache.spark.network.util.JavaUtils;
@@ -667,4 +669,49 @@ public void testPeakMemoryUsed() {
 }
   }
 
+  @Test
+  public void avoidDeadlock() throws InterruptedException {
+memoryManager.limit(PAGE_SIZE_BYTES);
+MemoryMode mode = useOffHeapMemoryAllocator() ? MemoryMode.OFF_HEAP: 
MemoryMode.ON_HEAP;
+TestMemoryConsumer c1 = new TestMemoryConsumer(taskMemoryManager, mode);
+BytesToBytesMap map =
+  new BytesToBytesMap(taskMemoryManager, blockManager, serializerManager, 
1, 0.5, 1024, false);
+
+Thread thread = new Thread(() -> {
+  int i = 0;
+  long used = 0;
+  while (i < 10) {
+c1.use(1000);
+used += 1000;
+i++;
+  }
+  c1.free(used);
+});
+
+try {
+  int i;
+  for (i = 0; i < 1024; i++) {
+final long[] arr = new long[]{i};
+final BytesToBytesMap.Location loc = map.lookup(arr, 
Platform.LONG_ARRAY_OFFSET, 8);
+loc.append(arr, Platform.LONG_ARRAY_OFFSET, 8, arr, 
Platform.LONG_ARRAY_OFFSET, 8);
+  }
+
+  // Starts to require memory at another memory consumer.
+  thread.start();
+
+  BytesToBytesMap.MapIterator iter = map.destructiveIterator();
+  for (i = 0; i < 1024; i++) {
+

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718298#comment-16718298
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR commented on issue #23260: [SPARK-26311][YARN] New feature: custom 
log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446418872
 
 
   @vanzin 
   IMHO we can't go with approach #20326 went since cluster ID is not available 
in default log URL so cannot retrieve. Also for me relying on pattern 
(`container_`) in URL feels fragile: while #20326 only extracted container ID 
from URL, we need to get some other parameters as well, which opens bigger 
possibility to be broken.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718246#comment-16718246
 ] 

ASF GitHub Bot commented on SPARK-24561:


SparkQA commented on issue #22305: [SPARK-24561][SQL][Python] User-defined 
window aggregation functions with Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#issuecomment-446412545
 
 
   **[Test build #99989 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99989/testReport)**
 for PR 22305 at commit 
[`5d3bbd6`](https://github.com/apache/spark/commit/5d3bbd6bfbdd6b332c65f9748ab066d9eb6a480d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718250#comment-16718250
 ] 

ASF GitHub Bot commented on SPARK-24561:


AmplabJenkins removed a comment on issue #22305: [SPARK-24561][SQL][Python] 
User-defined window aggregation functions with Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#issuecomment-446412883
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718254#comment-16718254
 ] 

ASF GitHub Bot commented on SPARK-26193:


rxin commented on issue #23286: [SPARK-26193][SQL][Follow Up] Read metrics 
rename and display text changes
URL: https://github.com/apache/spark/pull/23286#issuecomment-446413488
 
 
   LGTM too


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement shuffle write metrics in SQL
> --
>
> Key: SPARK-26193
> URL: https://issues.apache.org/jira/browse/SPARK-26193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718253#comment-16718253
 ] 

ASF GitHub Bot commented on SPARK-24561:


HyukjinKwon commented on a change in pull request #22305: 
[SPARK-24561][SQL][Python] User-defined window aggregation functions with 
Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#discussion_r240841612
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/window/AggregateProcessor.scala
 ##
 @@ -113,7 +113,7 @@ private[window] object AggregateProcessor {
  * This class manages the processing of a number of aggregate functions. See 
the documentation of
  * the object for more information.
  */
-private[window] final class AggregateProcessor(
+private[sql] final class AggregateProcessor(
 
 Review comment:
   Yea, see also some JIRAs like SPARK-16964.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718251#comment-16718251
 ] 

ASF GitHub Bot commented on SPARK-24561:


AmplabJenkins removed a comment on issue #22305: [SPARK-24561][SQL][Python] 
User-defined window aggregation functions with Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#issuecomment-446412887
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99989/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718248#comment-16718248
 ] 

ASF GitHub Bot commented on SPARK-24561:


AmplabJenkins commented on issue #22305: [SPARK-24561][SQL][Python] 
User-defined window aggregation functions with Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#issuecomment-446412883
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718249#comment-16718249
 ] 

ASF GitHub Bot commented on SPARK-24561:


AmplabJenkins commented on issue #22305: [SPARK-24561][SQL][Python] 
User-defined window aggregation functions with Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#issuecomment-446412887
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99989/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718247#comment-16718247
 ] 

ASF GitHub Bot commented on SPARK-24561:


SparkQA removed a comment on issue #22305: [SPARK-24561][SQL][Python] 
User-defined window aggregation functions with Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#issuecomment-446348650
 
 
   **[Test build #99989 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99989/testReport)**
 for PR 22305 at commit 
[`5d3bbd6`](https://github.com/apache/spark/commit/5d3bbd6bfbdd6b332c65f9748ab066d9eb6a480d).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718222#comment-16718222
 ] 

ASF GitHub Bot commented on SPARK-25642:


AmplabJenkins removed a comment on issue #22498: [SPARK-25642] : Adding two new 
metrics to record the number of registered connections as well as the number of 
active connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#issuecomment-446407716
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718216#comment-16718216
 ] 

ASF GitHub Bot commented on SPARK-25642:


vanzin commented on a change in pull request #22498: [SPARK-25642] : Adding two 
new metrics to record the number of registered connections as well as the 
number of active connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#discussion_r240835732
 
 

 ##
 File path: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 ##
 @@ -199,6 +191,18 @@ protected void serviceInit(Configuration conf) throws 
Exception {
   port = shuffleServer.getPort();
   boundPort = port;
   String authEnabledString = authEnabled ? "enabled" : "not enabled";
+
+  // register metrics on the block handler into the Node Manager's metrics 
system.
+  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
+  shuffleServer.getRegisteredConnections());
 
 Review comment:
   nit: indented too far, here and in others.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718217#comment-16718217
 ] 

ASF GitHub Bot commented on SPARK-25642:


vanzin commented on issue #22498: [SPARK-25642] : Adding two new metrics to 
record the number of registered connections as well as the number of active 
connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#issuecomment-446406987
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718223#comment-16718223
 ] 

ASF GitHub Bot commented on SPARK-25642:


AmplabJenkins removed a comment on issue #22498: [SPARK-25642] : Adding two new 
metrics to record the number of registered connections as well as the number of 
active connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#issuecomment-446407720
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5992/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26321) Split a SQL in a correct way

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718213#comment-16718213
 ] 

ASF GitHub Bot commented on SPARK-26321:


sadhen commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
URL: https://github.com/apache/spark/pull/23276#issuecomment-446406264
 
 
   OK I will add more desc.
   
   @srowen This is actually a trivial PR！Please review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Split a SQL in a correct way
> 
>
> Key: SPARK-26321
> URL: https://issues.apache.org/jira/browse/SPARK-26321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Darcy Shen
>Priority: Major
>
> First:
> ./build/mvn -Phive-thriftserver -DskipTests package
>  
> Then:
> $ bin/spark-sql
>  
> 18/12/10 19:35:02 INFO SparkSQLCLIDriver: Time taken: 4.483 seconds, Fetched 
> 1 row(s)
>  spark-sql> select "1;2";
>  Error in query:
>  no viable alternative at input 'select "'(line 1, pos 7)
> == SQL ==
>  select "1
>  ---^^^



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718220#comment-16718220
 ] 

ASF GitHub Bot commented on SPARK-25642:


AmplabJenkins commented on issue #22498: [SPARK-25642] : Adding two new metrics 
to record the number of registered connections as well as the number of active 
connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#issuecomment-446407720
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5992/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718221#comment-16718221
 ] 

ASF GitHub Bot commented on SPARK-25642:


SparkQA commented on issue #22498: [SPARK-25642] : Adding two new metrics to 
record the number of registered connections as well as the number of active 
connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#issuecomment-446407732
 
 
   **[Test build #6 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/6/testReport)**
 for PR 22498 at commit 
[`70472a2`](https://github.com/apache/spark/commit/70472a255e5da3ea4522959e26f5c403641e1ce6).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718219#comment-16718219
 ] 

ASF GitHub Bot commented on SPARK-25642:


AmplabJenkins commented on issue #22498: [SPARK-25642] : Adding two new metrics 
to record the number of registered connections as well as the number of active 
connections to YARN Shuffle Service
URL: https://github.com/apache/spark/pull/22498#issuecomment-446407716
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add new Metrics in External Shuffle Service to help determine Network 
> performance and Connection Handling capabilities of the Shuffle Service
> -
>
> Key: SPARK-25642
> URL: https://issues.apache.org/jira/browse/SPARK-25642
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Recently, the ability to expose the metrics for YARN Shuffle Service was 
> added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. 
> We need to add some metrics to be able to determine the number of active 
> connections as well as open connections to the external shuffle service to 
> benchmark network and connection issues on large cluster environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718209#comment-16718209
 ] 

ASF GitHub Bot commented on SPARK-24938:


AmplabJenkins removed a comment on issue #22114: [SPARK-24938][Core] Prevent 
Netty from using onheap memory for headers without regard for configuration
URL: https://github.com/apache/spark/pull/22114#issuecomment-446405325
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718192#comment-16718192
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718210#comment-16718210
 ] 

ASF GitHub Bot commented on SPARK-24938:


AmplabJenkins removed a comment on issue #22114: [SPARK-24938][Core] Prevent 
Netty from using onheap memory for headers without regard for configuration
URL: https://github.com/apache/spark/pull/22114#issuecomment-446405327
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5991/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718208#comment-16718208
 ] 

ASF GitHub Bot commented on SPARK-25877:


mccheah commented on issue #23220: [SPARK-25877][k8s] Move all feature logic to 
feature classes.
URL: https://github.com/apache/spark/pull/23220#issuecomment-446405494
 
 
   +1 from me, would like @liyinan926 to take a second look


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Put all feature-related code in the feature step itself
> ---
>
> Key: SPARK-25877
> URL: https://issues.apache.org/jira/browse/SPARK-25877
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a child task of SPARK-25874. It covers having all the code related to 
> features in the feature steps themselves, including logic about whether a 
> step should be applied or not.
> Please refer to the parent bug for further details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718205#comment-16718205
 ] 

ASF GitHub Bot commented on SPARK-24938:


AmplabJenkins commented on issue #22114: [SPARK-24938][Core] Prevent Netty from 
using onheap memory for headers without regard for configuration
URL: https://github.com/apache/spark/pull/22114#issuecomment-446405325
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718207#comment-16718207
 ] 

ASF GitHub Bot commented on SPARK-24938:


SparkQA commented on issue #22114: [SPARK-24938][Core] Prevent Netty from using 
onheap memory for headers without regard for configuration
URL: https://github.com/apache/spark/pull/22114#issuecomment-446405364
 
 
   **[Test build #5 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/testReport)**
 for PR 22114 at commit 
[`c2f9ed1`](https://github.com/apache/spark/commit/c2f9ed10776842ffe0746fcc89b157675fa6c455).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718206#comment-16718206
 ] 

ASF GitHub Bot commented on SPARK-24938:


AmplabJenkins commented on issue #22114: [SPARK-24938][Core] Prevent Netty from 
using onheap memory for headers without regard for configuration
URL: https://github.com/apache/spark/pull/22114#issuecomment-446405327
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5991/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718201#comment-16718201
 ] 

ASF GitHub Bot commented on SPARK-25877:


mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move 
all feature logic to feature classes.
URL: https://github.com/apache/spark/pull/23220#discussion_r240833837
 
 

 ##
 File path: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala
 ##
 @@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.io.File
+
+import io.fabric8.kubernetes.api.model.{Config => _, _}
+import io.fabric8.kubernetes.client.KubernetesClient
+import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource}
+import org.mockito.Matchers.any
+import org.mockito.Mockito.{mock, never, verify, when}
+import scala.collection.JavaConverters._
+
+import org.apache.spark.{SparkConf, SparkException, SparkFunSuite}
+import org.apache.spark.deploy.k8s._
+import org.apache.spark.internal.config.ConfigEntry
+
+abstract class PodBuilderSuite extends SparkFunSuite {
+
+  protected def templateFileConf: ConfigEntry[_]
+
+  protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): 
SparkPod
+
+  private val baseConf = new SparkConf(false)
+.set(Config.CONTAINER_IMAGE, "spark-executor:latest")
+
+  test("use empty initial pod if template is not specified") {
+val client = mock(classOf[KubernetesClient])
+buildPod(baseConf.clone(), client)
+verify(client, never()).pods()
+  }
+
+  test("load pod template if specified") {
+val client = mockKubernetesClient()
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val pod = buildPod(sparkConf, client)
+verifyPod(pod)
+  }
+
+  test("complain about misconfigured pod template") {
+val client = mockKubernetesClient(
+  new PodBuilder()
+.withNewMetadata()
+.addToLabels("test-label-key", "test-label-value")
+.endMetadata()
+.build())
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val exception = intercept[SparkException] {
+  buildPod(sparkConf, client)
+}
+assert(exception.getMessage.contains("Could not load pod from template 
file."))
+  }
+
+  private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): 
KubernetesClient = {
+val kubernetesClient = mock(classOf[KubernetesClient])
+val pods =
+  mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, 
DoneablePod]]])
+val podResource = mock(classOf[PodResource[Pod, DoneablePod]])
+when(kubernetesClient.pods()).thenReturn(pods)
+when(pods.load(any(classOf[File]))).thenReturn(podResource)
+when(podResource.get()).thenReturn(pod)
+kubernetesClient
+  }
+
+  private def verifyPod(pod: SparkPod): Unit = {
 
 Review comment:
   > That's also an argument for not restoring the mocks, which would go 
against what this change is doing. This test should account for modifications 
made by other steps, since if they modify something unexpected, that can change 
the semantics of the feature (pod template support).
   
   Wouldn't most of those unexpected changes come from the unit tests of the 
individual steps? Granted this test can catch when a change in one step impacts 
behavior in another step, which is important. Given that this isn't changing 
prior code I'm fine with leaving this as-is and addressing again later if it 
becomes a problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Put all feature-related code in the feature step itself
> ---
>
> Key: SPARK-25877
> URL: https://issues.apache.org/jira/browse/SPARK-25877
>

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718199#comment-16718199
 ] 

ASF GitHub Bot commented on SPARK-24938:


vanzin commented on issue #22114: [SPARK-24938][Core] Prevent Netty from using 
onheap memory for headers without regard for configuration
URL: https://github.com/apache/spark/pull/22114#issuecomment-446404726
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718197#comment-16718197
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718180#comment-16718180
 ] 

ASF GitHub Bot commented on SPARK-13356:


vanzin commented on issue #11228: [SPARK-13356][Streaming]WebUI missing input 
informations when recovering from dirver failure
URL: https://github.com/apache/spark/pull/11228#issuecomment-446402361
 
 
   This looks very out of date. I'll close this for now, but if you want to 
update and reopen the PR, just push to your branch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> WebUI missing input informations when recovering from dirver failure
> 
>
> Key: SPARK-13356
> URL: https://issues.apache.org/jira/browse/SPARK-13356
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 1.6.0
>Reporter: jeanlyn
>Priority: Minor
> Attachments: DirectKafkaScreenshot.jpg
>
>
> WebUI missing some input information when streaming recover from checkpoint, 
> it may confuse people the data had lose when recover from failure.
> For example:
> !DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718194#comment-16718194
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531
 
 
   **[Test build #3 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)**
 for PR 23291 at commit 
[`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718193#comment-16718193
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718195#comment-16718195
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718191#comment-16718191
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of 
In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446404185
 
 
   **[Test build #3 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)**
 for PR 23291 at commit 
[`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2).
* This patch **fails to generate documentation**.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718183#comment-16718183
 ] 

ASF GitHub Bot commented on SPARK-26311:


vanzin commented on issue #23260: [SPARK-26311][YARN] New feature: custom log 
URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446402857
 
 
   This is the other PR I was referring to: #20326


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718181#comment-16718181
 ] 

ASF GitHub Bot commented on SPARK-13356:


vanzin closed pull request #11228: [SPARK-13356][Streaming]WebUI missing input 
informations when recovering from dirver failure
URL: https://github.com/apache/spark/pull/11228
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
 
b/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
index 7e57bb18cbd50..01404b7a53540 100644
--- 
a/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
+++ 
b/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala
@@ -313,16 +313,32 @@ private[spark] class DirectKafkaInputDStream[K, V](
 
 override def restore(): Unit = {
   batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
- logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", 
"]")}")
- generatedRDDs += t -> new KafkaRDD[K, V](
-   context.sparkContext,
-   executorKafkaParams,
-   b.map(OffsetRange(_)),
-   getPreferredHosts,
-   // during restore, it's possible same partition will be consumed 
from multiple
-   // threads, so dont use cache
-   false
- )
+logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", 
"]")}")
+val recoverOffsets = b.map(OffsetRange(_))
+val rdd = new KafkaRDD[K, V](
+  context.sparkContext,
+  executorKafkaParams,
+  recoverOffsets,
+  getPreferredHosts,
+  // during restore, it's possible same partition will be consumed 
from multiple
+  // threads, so dont use cache
+  false
+)
+// Report the record number and metadata of this batch interval to 
InputInfoTracker.
+val description = recoverOffsets.filter { offsetRange =>
+  // Don't display empty ranges.
+  offsetRange.fromOffset != offsetRange.untilOffset
+}.map { offsetRange =>
+  s"topic: ${offsetRange.topic}\tpartition: 
${offsetRange.partition}\t" +
+s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
+}.mkString("\n")
+// Copy offsetRanges to immutable.List to prevent from being modified 
by the user
+val metadata = Map(
+  "offsets" -> recoverOffsets.toList,
+  StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
+val inputInfo = StreamInputInfo(id, rdd.count, metadata)
+ssc.scheduler.inputInfoTracker.reportInfo(t, inputInfo)
+generatedRDDs += t -> rdd
   }
 }
   }
diff --git 
a/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
 
b/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
index c3c799375bbeb..274d1cbbb07e1 100644
--- 
a/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
+++ 
b/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala
@@ -210,9 +210,26 @@ class DirectKafkaInputDStream[
   val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics))
 
   batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
- logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", 
"]")}")
- generatedRDDs += t -> new KafkaRDD[K, V, U, T, R](
-   context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, 
messageHandler)
+logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", 
"]")}")
+val recoverOffsets = b.map(OffsetRange(_))
+val rdd = new KafkaRDD[K, V, U, T, R](
+  context.sparkContext, kafkaParams, recoverOffsets, leaders, 
messageHandler)
+
+val description = recoverOffsets.filter { offsetRange =>
+  // Don't display empty ranges.
+  offsetRange.fromOffset != offsetRange.untilOffset
+}.map { offsetRange =>
+  s"topic: ${offsetRange.topic}\tpartition: 
${offsetRange.partition}\t" +
+s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
+}.mkString("\n")
+
+// Copy offsetRanges to immutable.List to prevent from being modified 
by the user
+val metadata = Map(
+  "offsets" -> recoverOffsets.toList,
+  StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
+val inputInfo

[jira] [Commented] (SPARK-25064) Total Tasks in WebUI does not match Active+Failed+Complete Tasks

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718179#comment-16718179
 ] 

ASF GitHub Bot commented on SPARK-25064:


vanzin commented on issue #22051: [SPARK-25064][WEBUI] Add killed tasks count 
info to WebUI
URL: https://github.com/apache/spark/pull/22051#issuecomment-446402102
 
 
   Hmm, I'm not so sure that "tasks killed on executor x" is a very interesting 
metric. If this information was altogether missing, sure, but this seems to 
only touch the per-executor information.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Total Tasks in WebUI does not match Active+Failed+Complete Tasks
> 
>
> Key: SPARK-25064
> URL: https://issues.apache.org/jira/browse/SPARK-25064
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Minor
> Attachments: 1533128402933_3.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718177#comment-16718177
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718176#comment-16718176
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718174#comment-16718174
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718173#comment-16718173
 ] 

ASF GitHub Bot commented on SPARK-26203:


AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718172#comment-16718172
 ] 

ASF GitHub Bot commented on SPARK-26203:


SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of 
In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531
 
 
   **[Test build #3 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)**
 for PR 23291 at commit 
[`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -
>
> Key: SPARK-26203
> URL: https://issues.apache.org/jira/browse/SPARK-26203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22640) Can not switch python exec in executor side

2018-12-11 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22640:
--

Assignee: Marcelo Vanzin

> Can not switch python exec in executor side
> ---
>
> Key: SPARK-22640
> URL: https://issues.apache.org/jira/browse/SPARK-22640
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Marcelo Vanzin
>Priority: Major
>
> {code:java}
> PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \
> bin/spark-submit --master yarn --deploy-mode client \ 
> --archives ~/anaconda3/envs/py3.zip \
> --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \
> --conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python  \
> /home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py
> {code}
> In the case above, I created a python environment, delivered it via 
> `--arichives`, then visited it on Executor Node via 
> `spark.executorEnv.PYSPARK_PYTHON`.
> But Executor seemed to use `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` 
> instead of `spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python`
> {code:java}
> java.io.IOException: Cannot run program 
> "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or 
> directory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error=2, No such file or directory
>   at java.lang.UNIXProcess.forkAndExec(Native Method)
>   at java.lang.UNIXProcess.(UNIXProcess.java:248)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 23 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22640) Can not switch python exec in executor side

2018-12-11 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22640:
--

Assignee: (was: Marcelo Vanzin)

> Can not switch python exec in executor side
> ---
>
> Key: SPARK-22640
> URL: https://issues.apache.org/jira/browse/SPARK-22640
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \
> bin/spark-submit --master yarn --deploy-mode client \ 
> --archives ~/anaconda3/envs/py3.zip \
> --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \
> --conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python  \
> /home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py
> {code}
> In the case above, I created a python environment, delivered it via 
> `--arichives`, then visited it on Executor Node via 
> `spark.executorEnv.PYSPARK_PYTHON`.
> But Executor seemed to use `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` 
> instead of `spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python`
> {code:java}
> java.io.IOException: Cannot run program 
> "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or 
> directory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error=2, No such file or directory
>   at java.lang.UNIXProcess.forkAndExec(Native Method)
>   at java.lang.UNIXProcess.(UNIXProcess.java:248)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 23 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718140#comment-16718140
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446394147
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/2/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718142#comment-16718142
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New 
feature: custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446394143
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718143#comment-16718143
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New 
feature: custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446394147
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/2/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718139#comment-16718139
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446394143
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718137#comment-16718137
 ] 

ASF GitHub Bot commented on SPARK-26311:


SparkQA removed a comment on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446389268
 
 
   **[Test build #2 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/testReport)**
 for PR 23260 at commit 
[`4c865fd`](https://github.com/apache/spark/commit/4c865fdc06b327bc9fb5865d9b5cbd600a892539).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718136#comment-16718136
 ] 

ASF GitHub Bot commented on SPARK-26311:


SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log 
URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446394066
 
 
   **[Test build #2 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/testReport)**
 for PR 23260 at commit 
[`4c865fd`](https://github.com/apache/spark/commit/4c865fdc06b327bc9fb5865d9b5cbd600a892539).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11085) Add support for HTTP proxy

2018-12-11 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11085.

Resolution: Duplicate

I believe SPARK-17568 should take care of this, since that allows you to use a 
custom Ivy config file.

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718112#comment-16718112
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] 
New feature: custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#discussion_r240817633
 
 

 ##
 File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
 ##
 @@ -246,13 +253,57 @@ private[yarn] class ExecutorRunnable(
   sys.env.get("SPARK_USER").foreach { user =>
 val containerId = ConverterUtils.toString(c.getId)
 val address = c.getNodeHttpAddress
-val baseUrl = 
s"$httpScheme$address/node/containerlogs/$containerId/$user"
 
-env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096"
-env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096"
+val customLogUrl = sparkConf.get(config.CUSTOM_LOG_URL)
+
+val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr",
+  "SPARK_LOG_URL_STDOUT" -> "stdout")
+val logUrls = ExecutorRunnable.buildLogUrls(customLogUrl, httpScheme, 
address,
+  clusterId, containerId, user, envNameToFileNameMap)
+logUrls.foreach { case (envName, url) =>
+  env(envName) = url
+}
   }
 }
 
 env
   }
 }
+
+private[yarn] object ExecutorRunnable {
+  def buildLogUrls(
+logUrlPattern: String,
+httpScheme: String,
+nodeHttpAddress: String,
+clusterId: Option[String],
+containerId: String,
+user: String,
+envNameToFileNameMap: Map[String, String]): Map[String, String] = {
 
 Review comment:
   Ah yes I just confused indent rule between method parameters and return... 
Nice catch. Will address.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718111#comment-16718111
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] 
New feature: custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#discussion_r240817431
 
 

 ##
 File path: 
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnLogUrlSuite.scala
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn
+
+import org.apache.spark.SparkFunSuite
+
+class YarnLogUrlSuite extends SparkFunSuite {
+
+  private val testHttpScheme = "https://;
+  private val testNodeHttpAddress = "nodeManager:1234"
+  private val testContainerId = "testContainer"
+  private val testUser = "testUser"
+  private val testEnvNameToFileNameMap = Map("TEST_ENV_STDOUT" -> "stdout",
+"TEST_ENV_STDERR" -> "stderr")
+
+  test("Custom log URL - leverage all patterns, all values for patterns are 
available") {
+val logUrlPattern = 
"{{HttpScheme}}{{NodeHttpAddress}}/logs/clusters/{{ClusterId}}" +
+  "/containers/{{ContainerId}}/users/{{User}}/files/{{FileName}}"
+
+val clusterId = Some("testCluster")
+
+val logUrls = ExecutorRunnable.buildLogUrls(logUrlPattern, testHttpScheme, 
testNodeHttpAddress,
+  clusterId, testContainerId, testUser, testEnvNameToFileNameMap)
+
+val expectedLogUrls = testEnvNameToFileNameMap.map { case (envName, 
fileName) =>
+  envName -> 
(s"$testHttpScheme$testNodeHttpAddress/logs/clusters/${clusterId.get}" +
+s"/containers/$testContainerId/users/$testUser/files/$fileName")
+}
+
+assert(logUrls === expectedLogUrls)
+  }
+
+  test("Custom log URL - optional pattern is not used in log URL") {
+// here {{ClusterId}} is excluded in this pattern
+val logUrlPattern = 
"{{HttpScheme}}{{NodeHttpAddress}}/logs/containers/{{ContainerId}}" +
+  "/users/{{User}}/files/{{FileName}}"
+
+// suppose the value of {{ClusterId}} pattern is not available
+val clusterId = None
+
+// This should not throw an exception: the value for optional pattern is 
not available
+// but we also don't use the pattern in log URL.
+val logUrls = ExecutorRunnable.buildLogUrls(logUrlPattern, testHttpScheme, 
testNodeHttpAddress,
+  clusterId, testContainerId, testUser, testEnvNameToFileNameMap)
+
+val expectedLogUrls = testEnvNameToFileNameMap.map { case (envName, 
fileName) =>
+  envName -> 
(s"$testHttpScheme$testNodeHttpAddress/logs/containers/$testContainerId" +
+s"/users/$testUser/files/$fileName")
+}
+
+assert(logUrls === expectedLogUrls)
+  }
+
+  test("Custom log URL - optional pattern is used in log URL but the value " +
+"is not present") {
 
 Review comment:
   Will address.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718107#comment-16718107
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR commented on issue #23260: [SPARK-26311][YARN] New feature: custom 
log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446387143
 
 
   > My understanding is that this allows pointing the Spark UI directly at the 
history server (old JHS or new ATS) instead of hardcoding the NM URL and 
relying on the NM redirecting you, since the NM may not exist later on.
   
   Yes, exactly. That's one of issue this patch enables to deal with, and 
another one would be cluster awareness. The existence of `the clusterId of RM` 
represents that YARN opens the possibility of maintaining multiple YARN 
clusters and provides centralized services which operates with multiple YARN 
clusters.
   
   > when perhaps if there was a way to hook this up on the Spark history 
server side only, that may be more useful.
   > I think someone tried that in the past but the SHS change was very 
YARN-specific, which made it kinda sub-optimal.
   
   I agree the case is rather not against running applications but finished 
applications. Currently Spark just sets executor log urls in environment at 
resource manager side and uses them. The usages are broad, and not sure we can 
determine which resource manager the application is based on, and whether 
application is running or finished in all usages. (I'm not familiar with UI 
side.) So this patch tackles the easiest way to deal with.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718114#comment-16718114
 ] 

ASF GitHub Bot commented on SPARK-26311:


SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log 
URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446389268
 
 
   **[Test build #2 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/testReport)**
 for PR 23260 at commit 
[`4c865fd`](https://github.com/apache/spark/commit/4c865fdc06b327bc9fb5865d9b5cbd600a892539).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718109#comment-16718109
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR edited a comment on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446387143
 
 
   @squito @vanzin 
   
   > I assume for your setup, you are not using {{NodeHttpAddress}} and are 
replacing it with something else centralized?
   
   Yes, exactly.
   
   > My understanding is that this allows pointing the Spark UI directly at the 
history server (old JHS or new ATS) instead of hardcoding the NM URL and 
relying on the NM redirecting you, since the NM may not exist later on.
   
   Yes, exactly. That's one of issue this patch enables to deal with, and 
another one would be cluster awareness. The existence of `the clusterId of RM` 
represents that YARN opens the possibility of maintaining multiple YARN 
clusters and provides centralized services which operates with multiple YARN 
clusters.
   
   > when perhaps if there was a way to hook this up on the Spark history 
server side only, that may be more useful.
   > I think someone tried that in the past but the SHS change was very 
YARN-specific, which made it kinda sub-optimal.
   
   I agree the case is rather not against running applications but finished 
applications. Currently Spark just sets executor log urls in environment at 
resource manager side and uses them. The usages are broad, and not sure we can 
determine which resource manager the application is based on, and whether 
application is running or finished in all usages. (I'm not familiar with UI 
side.) So this patch tackles the easiest way to deal with.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718108#comment-16718108
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR edited a comment on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446387143
 
 
   @squito @vanzin 
   
   > My understanding is that this allows pointing the Spark UI directly at the 
history server (old JHS or new ATS) instead of hardcoding the NM URL and 
relying on the NM redirecting you, since the NM may not exist later on.
   
   Yes, exactly. That's one of issue this patch enables to deal with, and 
another one would be cluster awareness. The existence of `the clusterId of RM` 
represents that YARN opens the possibility of maintaining multiple YARN 
clusters and provides centralized services which operates with multiple YARN 
clusters.
   
   > when perhaps if there was a way to hook this up on the Spark history 
server side only, that may be more useful.
   > I think someone tried that in the past but the SHS change was very 
YARN-specific, which made it kinda sub-optimal.
   
   I agree the case is rather not against running applications but finished 
applications. Currently Spark just sets executor log urls in environment at 
resource manager side and uses them. The usages are broad, and not sure we can 
determine which resource manager the application is based on, and whether 
application is running or finished in all usages. (I'm not familiar with UI 
side.) So this patch tackles the easiest way to deal with.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718085#comment-16718085
 ] 

ASF GitHub Bot commented on SPARK-26322:


SparkQA commented on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.sasl.token.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446381197
 
 
   **[Test build #1 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/1/testReport)**
 for PR 23274 at commit 
[`de35aa2`](https://github.com/apache/spark/commit/de35aa2479400c11a48658501a838bebd83553cd).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718083#comment-16718083
 ] 

ASF GitHub Bot commented on SPARK-26322:


gaborgsomogyi commented on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446380248
 
 
   retest this, please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718071#comment-16718071
 ] 

ASF GitHub Bot commented on SPARK-25877:


vanzin commented on a change in pull request #23220: [SPARK-25877][k8s] Move 
all feature logic to feature classes.
URL: https://github.com/apache/spark/pull/23220#discussion_r240805992
 
 

 ##
 File path: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala
 ##
 @@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.io.File
+
+import io.fabric8.kubernetes.api.model.{Config => _, _}
+import io.fabric8.kubernetes.client.KubernetesClient
+import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource}
+import org.mockito.Matchers.any
+import org.mockito.Mockito.{mock, never, verify, when}
+import scala.collection.JavaConverters._
+
+import org.apache.spark.{SparkConf, SparkException, SparkFunSuite}
+import org.apache.spark.deploy.k8s._
+import org.apache.spark.internal.config.ConfigEntry
+
+abstract class PodBuilderSuite extends SparkFunSuite {
+
+  protected def templateFileConf: ConfigEntry[_]
+
+  protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): 
SparkPod
+
+  private val baseConf = new SparkConf(false)
+.set(Config.CONTAINER_IMAGE, "spark-executor:latest")
+
+  test("use empty initial pod if template is not specified") {
+val client = mock(classOf[KubernetesClient])
+buildPod(baseConf.clone(), client)
+verify(client, never()).pods()
+  }
+
+  test("load pod template if specified") {
+val client = mockKubernetesClient()
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val pod = buildPod(sparkConf, client)
+verifyPod(pod)
+  }
+
+  test("complain about misconfigured pod template") {
+val client = mockKubernetesClient(
+  new PodBuilder()
+.withNewMetadata()
+.addToLabels("test-label-key", "test-label-value")
+.endMetadata()
+.build())
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val exception = intercept[SparkException] {
+  buildPod(sparkConf, client)
+}
+assert(exception.getMessage.contains("Could not load pod from template 
file."))
+  }
+
+  private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): 
KubernetesClient = {
+val kubernetesClient = mock(classOf[KubernetesClient])
+val pods =
+  mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, 
DoneablePod]]])
+val podResource = mock(classOf[PodResource[Pod, DoneablePod]])
+when(kubernetesClient.pods()).thenReturn(pods)
+when(pods.load(any(classOf[File]))).thenReturn(podResource)
+when(podResource.get()).thenReturn(pod)
+kubernetesClient
+  }
+
+  private def verifyPod(pod: SparkPod): Unit = {
+val metadata = pod.pod.getMetadata
+assert(metadata.getLabels.containsKey("test-label-key"))
+assert(metadata.getAnnotations.containsKey("test-annotation-key"))
+assert(metadata.getNamespace === "namespace")
+assert(metadata.getOwnerReferences.asScala.exists(_.getName == 
"owner-reference"))
+val spec = pod.pod.getSpec
+assert(!spec.getContainers.asScala.exists(_.getName == 
"executor-container"))
+assert(spec.getDnsPolicy === "dns-policy")
+assert(spec.getHostAliases.asScala.exists(_.getHostnames.asScala.exists(_ 
== "hostname")))
 
 Review comment:
   I'm not modifying this code, just moving it from its previous location.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Put all feature-related code in the feature step itself
> ---
>
> Key: SPARK-25877
> URL:

[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718070#comment-16718070
 ] 

ASF GitHub Bot commented on SPARK-25877:


vanzin commented on a change in pull request #23220: [SPARK-25877][k8s] Move 
all feature logic to feature classes.
URL: https://github.com/apache/spark/pull/23220#discussion_r240805965
 
 

 ##
 File path: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala
 ##
 @@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.io.File
+
+import io.fabric8.kubernetes.api.model.{Config => _, _}
+import io.fabric8.kubernetes.client.KubernetesClient
+import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource}
+import org.mockito.Matchers.any
+import org.mockito.Mockito.{mock, never, verify, when}
+import scala.collection.JavaConverters._
+
+import org.apache.spark.{SparkConf, SparkException, SparkFunSuite}
+import org.apache.spark.deploy.k8s._
+import org.apache.spark.internal.config.ConfigEntry
+
+abstract class PodBuilderSuite extends SparkFunSuite {
+
+  protected def templateFileConf: ConfigEntry[_]
+
+  protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): 
SparkPod
+
+  private val baseConf = new SparkConf(false)
+.set(Config.CONTAINER_IMAGE, "spark-executor:latest")
+
+  test("use empty initial pod if template is not specified") {
+val client = mock(classOf[KubernetesClient])
+buildPod(baseConf.clone(), client)
+verify(client, never()).pods()
+  }
+
+  test("load pod template if specified") {
+val client = mockKubernetesClient()
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val pod = buildPod(sparkConf, client)
+verifyPod(pod)
+  }
+
+  test("complain about misconfigured pod template") {
+val client = mockKubernetesClient(
+  new PodBuilder()
+.withNewMetadata()
+.addToLabels("test-label-key", "test-label-value")
+.endMetadata()
+.build())
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val exception = intercept[SparkException] {
+  buildPod(sparkConf, client)
+}
+assert(exception.getMessage.contains("Could not load pod from template 
file."))
+  }
+
+  private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): 
KubernetesClient = {
+val kubernetesClient = mock(classOf[KubernetesClient])
+val pods =
+  mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, 
DoneablePod]]])
+val podResource = mock(classOf[PodResource[Pod, DoneablePod]])
+when(kubernetesClient.pods()).thenReturn(pods)
+when(pods.load(any(classOf[File]))).thenReturn(podResource)
+when(podResource.get()).thenReturn(pod)
+kubernetesClient
+  }
+
+  private def verifyPod(pod: SparkPod): Unit = {
+val metadata = pod.pod.getMetadata
+assert(metadata.getLabels.containsKey("test-label-key"))
+assert(metadata.getAnnotations.containsKey("test-annotation-key"))
+assert(metadata.getNamespace === "namespace")
+assert(metadata.getOwnerReferences.asScala.exists(_.getName == 
"owner-reference"))
+val spec = pod.pod.getSpec
+assert(!spec.getContainers.asScala.exists(_.getName == 
"executor-container"))
 
 Review comment:
   I'm not modifying this code, just moving it from its previous location.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Put all feature-related code in the feature step itself
> ---
>
> Key: SPARK-25877
> URL: https://issues.apache.org/jira/browse/SPARK-25877
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718068#comment-16718068
 ] 

ASF GitHub Bot commented on SPARK-26239:


asfgit closed pull request #23252: [SPARK-26239] File-based secret key loading 
for SASL.
URL: https://github.com/apache/spark/pull/23252
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
b/core/src/main/scala/org/apache/spark/SecurityManager.scala
index 96e4b53b24181..15783c952c231 100644
--- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
+++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
@@ -17,8 +17,11 @@
 
 package org.apache.spark
 
+import java.io.File
 import java.net.{Authenticator, PasswordAuthentication}
 import java.nio.charset.StandardCharsets.UTF_8
+import java.nio.file.Files
+import java.util.Base64
 
 import org.apache.hadoop.io.Text
 import org.apache.hadoop.security.{Credentials, UserGroupInformation}
@@ -43,7 +46,8 @@ import org.apache.spark.util.Utils
  */
 private[spark] class SecurityManager(
 sparkConf: SparkConf,
-val ioEncryptionKey: Option[Array[Byte]] = None)
+val ioEncryptionKey: Option[Array[Byte]] = None,
+authSecretFileConf: ConfigEntry[Option[String]] = AUTH_SECRET_FILE)
   extends Logging with SecretKeyHolder {
 
   import SecurityManager._
@@ -328,6 +332,7 @@ private[spark] class SecurityManager(
 .orElse(Option(secretKey))
 .orElse(Option(sparkConf.getenv(ENV_AUTH_SECRET)))
 .orElse(sparkConf.getOption(SPARK_AUTH_SECRET_CONF))
+.orElse(secretKeyFromFile())
 .getOrElse {
   throw new IllegalArgumentException(
 s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF 
config")
@@ -348,7 +353,6 @@ private[spark] class SecurityManager(
*/
   def initializeAuth(): Unit = {
 import SparkMasterRegex._
-val k8sRegex = "k8s.*".r
 
 if (!sparkConf.get(NETWORK_AUTH_ENABLED)) {
   return
@@ -371,7 +375,14 @@ private[spark] class SecurityManager(
 return
 }
 
-secretKey = Utils.createSecret(sparkConf)
+if (sparkConf.get(AUTH_SECRET_FILE_DRIVER).isDefined !=
+sparkConf.get(AUTH_SECRET_FILE_EXECUTOR).isDefined) {
+  throw new IllegalArgumentException(
+"Invalid secret configuration: Secret files must be specified for both 
the driver and the" +
+  " executors, not only one or the other.")
+}
+
+secretKey = secretKeyFromFile().getOrElse(Utils.createSecret(sparkConf))
 
 if (storeInUgi) {
   val creds = new Credentials()
@@ -380,6 +391,22 @@ private[spark] class SecurityManager(
 }
   }
 
+  private def secretKeyFromFile(): Option[String] = {
+sparkConf.get(authSecretFileConf).flatMap { secretFilePath =>
+  sparkConf.getOption(SparkLauncher.SPARK_MASTER).map {
+case k8sRegex() =>
+  val secretFile = new File(secretFilePath)
+  require(secretFile.isFile, s"No file found containing the secret key 
at $secretFilePath.")
+  val base64Key = 
Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath))
+  require(!base64Key.isEmpty, s"Secret key from file located at 
$secretFilePath is empty.")
+  base64Key
+case _ =>
+  throw new IllegalArgumentException(
+"Secret keys provided via files is only allowed in Kubernetes 
mode.")
+  }
+}
+  }
+
   // Default SecurityManager only has a single secret key, so ignore appId.
   override def getSaslUser(appId: String): String = getSaslUser()
   override def getSecretKey(appId: String): String = getSecretKey()
@@ -387,6 +414,7 @@ private[spark] class SecurityManager(
 
 private[spark] object SecurityManager {
 
+  val k8sRegex = "k8s.*".r
   val SPARK_AUTH_CONF = NETWORK_AUTH_ENABLED.key
   val SPARK_AUTH_SECRET_CONF = "spark.authenticate.secret"
   // This is used to set auth secret to an executor's env variable. It should 
have the same
diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
b/core/src/main/scala/org/apache/spark/SparkEnv.scala
index 66038eeaea54f..de0c8579d9acc 100644
--- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
+++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
@@ -232,8 +232,8 @@ object SparkEnv extends Logging {
 if (isDriver) {
   assert(listenerBus != null, "Attempted to create driver SparkEnv with 
null listener bus!")
 }
-
-val securityManager = new SecurityManager(conf, ioEncryptionKey)
+val authSecretFileConf = if (isDriver) AUTH_SECRET_FILE_DRIVER else 
AUTH_SECRET_FILE_EXECUTOR
+val securityManager = new SecurityManager(conf,

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718066#comment-16718066
 ] 

ASF GitHub Bot commented on SPARK-26322:


AmplabJenkins removed a comment on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446376257
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99986/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718067#comment-16718067
 ] 

ASF GitHub Bot commented on SPARK-25877:


vanzin commented on a change in pull request #23220: [SPARK-25877][k8s] Move 
all feature logic to feature classes.
URL: https://github.com/apache/spark/pull/23220#discussion_r240805744
 
 

 ##
 File path: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala
 ##
 @@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.io.File
+
+import io.fabric8.kubernetes.api.model.{Config => _, _}
+import io.fabric8.kubernetes.client.KubernetesClient
+import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource}
+import org.mockito.Matchers.any
+import org.mockito.Mockito.{mock, never, verify, when}
+import scala.collection.JavaConverters._
+
+import org.apache.spark.{SparkConf, SparkException, SparkFunSuite}
+import org.apache.spark.deploy.k8s._
+import org.apache.spark.internal.config.ConfigEntry
+
+abstract class PodBuilderSuite extends SparkFunSuite {
+
+  protected def templateFileConf: ConfigEntry[_]
+
+  protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): 
SparkPod
+
+  private val baseConf = new SparkConf(false)
+.set(Config.CONTAINER_IMAGE, "spark-executor:latest")
+
+  test("use empty initial pod if template is not specified") {
+val client = mock(classOf[KubernetesClient])
+buildPod(baseConf.clone(), client)
+verify(client, never()).pods()
+  }
+
+  test("load pod template if specified") {
+val client = mockKubernetesClient()
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val pod = buildPod(sparkConf, client)
+verifyPod(pod)
+  }
+
+  test("complain about misconfigured pod template") {
+val client = mockKubernetesClient(
+  new PodBuilder()
+.withNewMetadata()
+.addToLabels("test-label-key", "test-label-value")
+.endMetadata()
+.build())
+val sparkConf = baseConf.clone().set(templateFileConf.key, 
"template-file.yaml")
+val exception = intercept[SparkException] {
+  buildPod(sparkConf, client)
+}
+assert(exception.getMessage.contains("Could not load pod from template 
file."))
+  }
+
+  private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): 
KubernetesClient = {
+val kubernetesClient = mock(classOf[KubernetesClient])
+val pods =
+  mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, 
DoneablePod]]])
+val podResource = mock(classOf[PodResource[Pod, DoneablePod]])
+when(kubernetesClient.pods()).thenReturn(pods)
+when(pods.load(any(classOf[File]))).thenReturn(podResource)
+when(podResource.get()).thenReturn(pod)
+kubernetesClient
+  }
+
+  private def verifyPod(pod: SparkPod): Unit = {
 
 Review comment:
   I actually did not write this test. I copy & pasted it with zero 
modifications from the previous class, and I'd prefer to keep it that way.
   
   That's also an argument for *not* restoring the mocks, which would go 
against what this change is doing. This test should account for modifications 
made by other steps, since if they modify something unexpected, that can change 
the semantics of the feature (pod template support).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Put all feature-related code in the feature step itself
> ---
>
> Key: SPARK-25877
> URL: https://issues.apache.org/jira/browse/SPARK-25877
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a

[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718051#comment-16718051
 ] 

ASF GitHub Bot commented on SPARK-19827:


dongjoon-hyun commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240803593
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
 
 Review comment:
   I'll be more careful about this roxygen2 stuff.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718064#comment-16718064
 ] 

ASF GitHub Bot commented on SPARK-26322:


AmplabJenkins removed a comment on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446376247
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718061#comment-16718061
 ] 

ASF GitHub Bot commented on SPARK-26322:


AmplabJenkins commented on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446376247
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718059#comment-16718059
 ] 

ASF GitHub Bot commented on SPARK-26322:


SparkQA removed a comment on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446303207
 
 
   **[Test build #99986 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99986/testReport)**
 for PR 23274 at commit 
[`de35aa2`](https://github.com/apache/spark/commit/de35aa2479400c11a48658501a838bebd83553cd).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718060#comment-16718060
 ] 

ASF GitHub Bot commented on SPARK-26239:


mccheah commented on issue #23252: [SPARK-26239] File-based secret key loading 
for SASL.
URL: https://github.com/apache/spark/pull/23252#issuecomment-446376235
 
 
   Oh I fixed it just now


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718062#comment-16718062
 ] 

ASF GitHub Bot commented on SPARK-26322:


AmplabJenkins commented on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446376257
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99986/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-12-11 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26239:
--

Assignee: Matt Cheah

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-12-11 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26239.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23252
[https://github.com/apache/spark/pull/23252]

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718058#comment-16718058
 ] 

ASF GitHub Bot commented on SPARK-26322:


SparkQA commented on issue #23274: [SPARK-26322][SS] Add 
spark.kafka.token.sasl.mechanism to ease delegation token configuration.
URL: https://github.com/apache/spark/pull/23274#issuecomment-446375942
 
 
   **[Test build #99986 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99986/testReport)**
 for PR 23274 at commit 
[`de35aa2`](https://github.com/apache/spark/commit/de35aa2479400c11a48658501a838bebd83553cd).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Simplify kafka delegation token sasl.mechanism configuration
> 
>
> Key: SPARK-26322
> URL: https://issues.apache.org/jira/browse/SPARK-26322
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> When Kafka delegation token obtained, SCRAM sasl.mechanism has to be 
> configured for authentication. This can be configured on the related 
> source/sink which is inconvenient from user perspective. Such granularity is 
> not required and this configuration can be implemented with one central 
> parameter.
> Kafka now supports 2 SCRAM related sasl.mechanism:
> - SCRAM-SHA-256
> - SCRAM-SHA-512
> and these are configured on brokers, which makes this configuration global.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718056#comment-16718056
 ] 

ASF GitHub Bot commented on SPARK-26203:


dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] 
Benchmark performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291#discussion_r240804589
 
 

 ##
 File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala
 ##
 @@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.catalyst.expressions.In
+import org.apache.spark.sql.catalyst.expressions.InSet
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions.{array, struct}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+/**
+ * A benchmark that compares the performance of [[In]] and [[InSet]] 
expressions.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class  
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/InSetBenchmark-results.txt".
+ * }}}
+ */
+object InSetBenchmark extends SqlBasedBenchmark {
+
+  import spark.implicits._
+
+  def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems bytes"
+val values = (1 to numItems).map(v => s"CAST($v AS tinyint)")
+val df = spark.range(1, numRows).select($"id".cast(ByteType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems shorts"
+val values = (1 to numItems).map(v => s"CAST($v AS smallint)")
+val df = spark.range(1, numRows).select($"id".cast(ShortType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems ints"
+val values = 1 to numItems
+val df = spark.range(1, numRows).select($"id".cast(IntegerType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark 
= {
+val name = s"$numItems longs"
+val values = (1 to numItems).map(v => s"${v}L")
+val df = spark.range(1, numRows).toDF("id")
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems floats"
+val values = (1 to numItems).map(v => s"CAST($v AS float)")
+val df = spark.range(1, numRows).select($"id".cast(FloatType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems doubles"
+val values = 1.0 to numItems by 1.0
+val df = spark.range(1, numRows).select($"id".cast(DoubleType))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems small decimals"
+val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems large decimals"
+val values = (1 to numItems).map(v => s"9223372036854775812.10539$v")
+val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7)))
+benchmark(name, df, values, numRows, minNumIters)
+  }
+
+  def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): 
Benchmark = {
+val name = s"$numItems strings"
+val values = (1 to numItems).map(n => s"'$n'")
+val df = spark.range(1,

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-12-11 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718053#comment-16718053
 ] 

ASF GitHub Bot commented on SPARK-26239:


vanzin commented on a change in pull request #23252: [SPARK-26239] File-based 
secret key loading for SASL.
URL: https://github.com/apache/spark/pull/23252#discussion_r240802787
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/SecurityManager.scala
 ##
 @@ -367,11 +371,18 @@ private[spark] class SecurityManager(
 
   case _ =>
 require(sparkConf.contains(SPARK_AUTH_SECRET_CONF),
-  s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF 
config.")
+  s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF 
config")
 
 Review comment:
   Undo this change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 >

1 - 100 of 841 matches

Mail list logo