[jira] [Issue Comment Deleted] (SPARK-26335) Add an option for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keiji Yoshida updated SPARK-26335: -- Comment: was deleted (was: A screenshot of OASIS has been attached.) > Add an option for Dataset#show not to care about wide characters when padding > them > -- > > Key: SPARK-26335 > URL: https://issues.apache.org/jira/browse/SPARK-26335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Keiji Yoshida >Priority: Major > Attachments: Screen Shot 2018-12-11 at 17.53.54.png > > > h2. Issue > https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care > about wide characters when padding them. That is useful for humans to read a > result of Dataset#show. On the other hand, that makes it impossible for > programs to parse a result of Dataset#show because each cell's length can be > different from its header's length. My company develops and manages a > Jupyter/Apache Zeppelin-like visualization tool named "OASIS" > ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). > On this application, a result of Dataset#show on a Scala or Python process > is parsed to visualize it as an HTML table format. (A screenshot of OASIS has > been attached to this ticket as a file named "Screen Shot 2018-12-11 at > 17.53.54.png".) > h2. Solution > Add an option for Dataset#show not to care about wide characters when padding > them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26335) Add an option for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keiji Yoshida updated SPARK-26335: -- Description: h2. Issue https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about wide characters when padding them. That is useful for humans to read a result of Dataset#show. On the other hand, that makes it impossible for programs to parse a result of Dataset#show because each cell's length can be different from its header's length. My company develops and manages a Jupyter/Apache Zeppelin-like visualization tool named "OASIS" ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). On this application, a result of Dataset#show on a Scala or Python process is parsed to visualize it as an HTML table format. (A screenshot of OASIS has been attached to this ticket as a file named "Screen Shot 2018-12-11 at 17.53.54.png".) h2. Solution Add an option for Dataset#show not to care about wide characters when padding them. was: h2. Issue https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about wide characters when padding them. That is useful for humans to read a result of Dataset#show. On the other hand, that makes it impossible for programs to parse a result of Dataset#show because each cell's length can be different from its header's length. My company develops and manages a Jupyter/Apache Zeppelin-like visualization tool named "OASIS" ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). On this application, a result of Dataset#show on a Scala or Python process is parsed to visualize it as an HTML table format. () h2. Solution Add an option for Dataset#show not to care about wide characters when padding them. > Add an option for Dataset#show not to care about wide characters when padding > them > -- > > Key: SPARK-26335 > URL: https://issues.apache.org/jira/browse/SPARK-26335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Keiji Yoshida >Priority: Major > Attachments: Screen Shot 2018-12-11 at 17.53.54.png > > > h2. Issue > https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care > about wide characters when padding them. That is useful for humans to read a > result of Dataset#show. On the other hand, that makes it impossible for > programs to parse a result of Dataset#show because each cell's length can be > different from its header's length. My company develops and manages a > Jupyter/Apache Zeppelin-like visualization tool named "OASIS" > ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). > On this application, a result of Dataset#show on a Scala or Python process > is parsed to visualize it as an HTML table format. (A screenshot of OASIS has > been attached to this ticket as a file named "Screen Shot 2018-12-11 at > 17.53.54.png".) > h2. Solution > Add an option for Dataset#show not to care about wide characters when padding > them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26335) Add an option for Dataset#show not to care about wide characters when padding them
[ https://issues.apache.org/jira/browse/SPARK-26335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keiji Yoshida updated SPARK-26335: -- Description: h2. Issue https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about wide characters when padding them. That is useful for humans to read a result of Dataset#show. On the other hand, that makes it impossible for programs to parse a result of Dataset#show because each cell's length can be different from its header's length. My company develops and manages a Jupyter/Apache Zeppelin-like visualization tool named "OASIS" ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). On this application, a result of Dataset#show on a Scala or Python process is parsed to visualize it as an HTML table format. () h2. Solution Add an option for Dataset#show not to care about wide characters when padding them. was: https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care about wide characters when padding them. That is useful for humans to read a result of Dataset#show. On the other hand, that makes it impossible for programs to parse a result of Dataset#show because each cell's length can be difference from its header's length. My company develops and manages a Jupyter/Apache Zeppelin-like visualization tool named "OASIS" ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). On this application, a result of Dataset#show is parsed to visualize it as an HTML table format. So, it is preferable to add an option for Dataset#show not to care about wide characters when padding them by adding a parameter such as "fixedColLength" to Dataset#show. > Add an option for Dataset#show not to care about wide characters when padding > them > -- > > Key: SPARK-26335 > URL: https://issues.apache.org/jira/browse/SPARK-26335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Keiji Yoshida >Priority: Major > Attachments: Screen Shot 2018-12-11 at 17.53.54.png > > > h2. Issue > https://issues.apache.org/jira/browse/SPARK-25108 makes Dataset#show care > about wide characters when padding them. That is useful for humans to read a > result of Dataset#show. On the other hand, that makes it impossible for > programs to parse a result of Dataset#show because each cell's length can be > different from its header's length. My company develops and manages a > Jupyter/Apache Zeppelin-like visualization tool named "OASIS" > ([https://databricks.com/session/oasis-collaborative-data-analysis-platform-using-apache-spark]). > On this application, a result of Dataset#show on a Scala or Python process > is parsed to visualize it as an HTML table format. () > h2. Solution > Add an option for Dataset#show not to care about wide characters when padding > them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status
[ https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718527#comment-16718527 ] ASF GitHub Bot commented on SPARK-23674: HyukjinKwon opened a new pull request #23263: [SPARK-23674][ML] Adds Spark ML Events URL: https://github.com/apache/spark/pull/23263 ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track and add some actions for them. ## Introduction ML events (like SQL events) can be quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) Another good thing that might have to be considered is, that we can interact this with other SQL/Streaming events. For instance, where the input `Dataset` is originated. For instance, with current Apache Spark, I can visualise SQL operations as below: ![screen shot 2018-12-10 at 9 41 36 am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png) I think we can combine those existing lineages together to easily understand where the data comes and goes. Currently, ML side is a hole so the lineages can't be connected for the current Apache Spark .. To add up, I think it's not to mention how useful it is to track the SQL/Streaming operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) **`mllib/src/main/scala/org/apache/spark/ml/events.scala`** ```scala @Unstable case class ...StartEvent(caller, input) @Unstable case class ...EndEvent(caller, output) object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1. Catalog events (see `org/apache/spark/sql/catalyst/catalog/events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here **2. SQL execution events (see `org/apache/spark/sql/execution/SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **`mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **`mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) } + def fitImpl(...): Model ``` **`mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl(...) } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196
[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status
[ https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718526#comment-16718526 ] ASF GitHub Bot commented on SPARK-23674: felixcheung closed pull request #23263: [SPARK-23674][ML] Adds Spark ML Events URL: https://github.com/apache/spark/pull/23263 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala b/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala index 1247882d6c1bd..a3c4db06862f8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") + } /** * Fits multiple models to the input data with multiple sets of parameters. diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala index 103082b7b9766..1c781faff129e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) + override protected def fitImpl(dataset: Dataset[_]): PipelineModel = { transformSchema(dataset.schema, logging = true) val theStages = $(stages) // Search for the last estimator. @@ -210,7 +211,7 @@ object Pipeline extends MLReadable[Pipeline] { /** Checked against metadata when loading model */ private val className = classOf[Pipeline].getName -override def load(path: String): Pipeline = { +override protected def loadImpl(path: String): Pipeline = { val (uid: String, stages: Array[PipelineStage]) = SharedReadWrite.load(className, sc, path) new Pipeline(uid).setStages(stages) } @@ -301,7 +302,8 @@ class PipelineModel private[ml] ( } @Since("2.0.0") - override def transform(dataset: Dataset[_]): DataFrame = { + override def transform(dataset: Dataset[_]): DataFrame = super.transform(dataset) + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) stages.foldLeft(dataset.toDF)((cur, transformer) => transformer.transform(cur)) } @@ -344,7 +346,7 @@ object PipelineModel extends MLReadable[PipelineModel] { /** Checked against metadata when loading model */ private val className = classOf[PipelineModel].getName -override def load(path: String): PipelineModel = { +override protected def loadImpl(path: String): PipelineModel = { val (uid: String, stages: Array[PipelineStage]) = SharedReadWrite.load(className, sc, path) val transformers = stages map { case stage: Transformer => stage diff --git a/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala b/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala index d8f3dfa874439..3731ddae0160c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala @@ -94,7 +94,10 @@ abstract class Predictor[ /** @group setParam */ def setPredictionCol(value: String): Learner = set(predictionCol, value).asInstanceOf[Learner] - override def fit(dataset: Dataset[_]): M = { + // Explictly call parent's load. Otherwise, MiMa complains. + override def fit(dataset: Dataset[_]): M = super.fit(dataset) + + override protected def fitImpl(dataset: Dataset[_]): M = { // This handles a few items such as schema validation. // Developers only need to implement train(). transformSchema(dataset.schema, logging = true) @@ -199,7 +202,8 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, * @param dataset input dataset * @return transformed dataset with [[predictionCol]] of type `Double` */ - override def transform(dataset: Dataset[_]): DataFrame = { +
[jira] [Comment Edited] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510 ] Ayush Chauhan edited comment on SPARK-19185 at 12/12/18 6:52 AM: - Is this issue resolved because SPARK-22606 is still in progress? I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution to this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} was (Author: ayushchauhan): Is this issue resolved because [SPARK-22606|https://issues.apache.org/jira/browse/SPARK-22606] is still in progress? Because I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution of this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Gabor Somogyi >Priority: Major > Labels: streaming, windowing > Fix For: 2.4.0 > > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at
[jira] [Comment Edited] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510 ] Ayush Chauhan edited comment on SPARK-19185 at 12/12/18 6:52 AM: - Is this issue resolved because SPARK-22606 is still in progress? I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution to this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} was (Author: ayushchauhan): Is this issue resolved because SPARK-22606 is still in progress? I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution to this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Gabor Somogyi >Priority: Major > Labels: streaming, windowing > Fix For: 2.4.0 > > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at
[jira] [Comment Edited] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510 ] Ayush Chauhan edited comment on SPARK-19185 at 12/12/18 6:51 AM: - Is this issue resolved because [SPARK-22606|https://issues.apache.org/jira/browse/SPARK-22606] is still in progress? Because I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution of this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} was (Author: ayushchauhan): Is this issue resolved? Because I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution of this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Gabor Somogyi >Priority: Major > Labels: streaming, windowing > Fix For: 2.4.0 > > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at
[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718510#comment-16718510 ] Ayush Chauhan commented on SPARK-19185: --- Is this issue resolved? Because I am getting this error in spark 2.3.2 and spark-streaming-kafka-0-10. So this workaround is the solution of this problem? Is there any downside of setting this property? {code:java} spark.streaming.kafka.consumer.cache.enabled=false {code} > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Gabor Somogyi >Priority: Major > Labels: streaming, windowing > Fix For: 2.4.0 > > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at >
[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718438#comment-16718438 ] ASF GitHub Bot commented on SPARK-26265: viirya opened a new pull request #23294: [SPARK-26265][Core][Followup] Put freePage into a finally block URL: https://github.com/apache/spark/pull/23294 ## What changes were proposed in this pull request? Based on the [comment](https://github.com/apache/spark/pull/23272#discussion_r240735509), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so. ## How was this patch tested? Existing tests. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator > -- > > Key: SPARK-26265 > URL: https://issues.apache.org/jira/browse/SPARK-26265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: qian han >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > The application is running on a cluster with 72000 cores and 182000G mem. > Enviroment: > |spark.dynamicAllocation.minExecutors|5| > |spark.dynamicAllocation.initialExecutors|30| > |spark.dynamicAllocation.maxExecutors|400| > |spark.executor.cores|4| > |spark.executor.memory|20g| > > > Stage description: > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193) > > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > jstack information as follow: > Found one Java-level deadlock: = > "Thread-ScriptTransformation-Feed": waiting to lock monitor > 0x00e0cb18 (object 0x0002f1641538, a > org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task > launch worker for task 18899" "Executor task launch worker for task 18899": > waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by > "Thread-ScriptTransformation-Feed" Java stack information for the threads > listed above: === > "Thread-ScriptTransformation-Feed": at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332) > - waiting to lock <0x0002f1641538> (a > org.apache.spark.memory.TaskMemoryManager) at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at > org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66) > at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274) > - locked <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at >
[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set
[ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718428#comment-16718428 ] ASF GitHub Bot commented on SPARK-26327: dongjoon-hyun closed pull request #23287: [SPARK-26327][SQL][BACKPORT-2.4] Bug fix for `FileSourceScanExec` metrics update URL: https://github.com/apache/spark/pull/23287 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 36ed016773b67..5433c30afd6bb 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -185,19 +185,14 @@ case class FileSourceScanExec( partitionSchema = relation.partitionSchema, relation.sparkSession.sessionState.conf) + private var metadataTime = 0L + @transient private lazy val selectedPartitions: Seq[PartitionDirectory] = { val optimizerMetadataTimeNs = relation.location.metadataOpsTimeNs.getOrElse(0L) val startTime = System.nanoTime() val ret = relation.location.listFiles(partitionFilters, dataFilters) val timeTakenMs = ((System.nanoTime() - startTime) + optimizerMetadataTimeNs) / 1000 / 1000 - -metrics("numFiles").add(ret.map(_.files.size.toLong).sum) -metrics("metadataTime").add(timeTakenMs) - -val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) -SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, - metrics("numFiles") :: metrics("metadataTime") :: Nil) - +metadataTime = timeTakenMs ret } @@ -308,6 +303,8 @@ case class FileSourceScanExec( } private lazy val inputRDD: RDD[InternalRow] = { +// Update metrics for taking effect in both code generation node and normal node. +updateDriverMetrics() val readFile: (PartitionedFile) => Iterator[InternalRow] = relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, @@ -524,6 +521,19 @@ case class FileSourceScanExec( } } + /** + * Send the updated metrics to driver, while this function calling, selectedPartitions has + * been initialized. See SPARK-26327 for more detail. + */ + private def updateDriverMetrics() = { +metrics("numFiles").add(selectedPartitions.map(_.files.size.toLong).sum) +metrics("metadataTime").add(metadataTime) + +val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) +SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, + metrics("numFiles") :: metrics("metadataTime") :: Nil) + } + override def doCanonicalize(): FileSourceScanExec = { FileSourceScanExec( relation, diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala index 085a445488480..c550bf20b92b5 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala @@ -570,4 +570,19 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } } } + + test("SPARK-26327: FileSourceScanExec metrics") { +withTable("testDataForScan") { + spark.range(10).selectExpr("id", "id % 3 as p") +.write.partitionBy("p").saveAsTable("testDataForScan") + // The execution plan only has 1 FileScan node. + val df = spark.sql( +"SELECT * FROM testDataForScan WHERE p = 1") + testSparkPlanMetrics(df, 1, Map( +0L -> (("Scan parquet default.testdataforscan", Map( + "number of output rows" -> 3L, + "number of files" -> 2L + ) +} + } } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Metrics in FileSourceScanExec not update correctly while > relation.partitionSchema is set > > > Key: SPARK-26327 > URL: https://issues.apache.org/jira/browse/SPARK-26327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li
[jira] [Resolved] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
[ https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26333. Resolution: Not A Bug > FsHistoryProviderSuite failed because setReadable doesn't work in RedHat > > > Key: SPARK-26333 > URL: https://issues.apache.org/jira/browse/SPARK-26333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: deshanxiao >Priority: Major > > FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot > be read.". I try to invoke logFile2.canRead after invoking > "setReadable(false, false)" . And I find that the result of > "logFile2.canRead" is true but in my ubuntu16.04 return false. > The environment: > RedHat: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > JDK > Java version: 1.8.0_151, vendor: Oracle Corporation > {code:java} > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26332) Spark sql write orc table on viewFS throws exception
[ https://issues.apache.org/jira/browse/SPARK-26332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718393#comment-16718393 ] Bang Xiao commented on SPARK-26332: --- [~joshrosen] Hi, I found you maintain the branch of hive1.2.1-spark2, Could you please check this? > Spark sql write orc table on viewFS throws exception > > > Key: SPARK-26332 > URL: https://issues.apache.org/jira/browse/SPARK-26332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Bang Xiao >Priority: Major > > Using SparkSQL write orc table on viewFs will cause exception: > {code:java} > Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.fs.viewfs.NotInMountpointException: > getDefaultReplication on empty path is invalid > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634) > at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:352) > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) > at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2413) > at > org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:392) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > ... 8 more > Suppressed: org.apache.hadoop.fs.viewfs.NotInMountpointException: > getDefaultReplication on empty path is invalid > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634) > at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120) > at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2425) > at > org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.close(HiveFileFormat.scala:154) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:275) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1423) > ... 9 more{code} > this exception can be reproduced by follow sqls: > {code:java} > spark-sql> CREATE EXTERNAL TABLE test_orc(test_id INT, test_age INT, > test_rank INT) STORED AS ORC LOCATION > 'viewfs://nsX/user/hive/warehouse/ultraman_tmp.db/test_orc'; > spark-sql> CREATE TABLE source(id INT, age INT, rank INT); > spark-sql> INSERT INTO source VALUES(1,1,1); > spark-sql> INSERT OVERWRITE TABLE test_orc SELECT * FROM source; > {code} > this is related to https://issues.apache.org/jira/browse/HIVE-10790. and > resolved
[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
[ https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718397#comment-16718397 ] Marcelo Vanzin commented on SPARK-26333: {{setReadable}} works as root. But being root, you can read anything, regardless of permissions. > FsHistoryProviderSuite failed because setReadable doesn't work in RedHat > > > Key: SPARK-26333 > URL: https://issues.apache.org/jira/browse/SPARK-26333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: deshanxiao >Priority: Major > > FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot > be read.". I try to invoke logFile2.canRead after invoking > "setReadable(false, false)" . And I find that the result of > "logFile2.canRead" is true but in my ubuntu16.04 return false. > The environment: > RedHat: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > JDK > Java version: 1.8.0_151, vendor: Oracle Corporation > {code:java} > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26343) Speed up running the kubernetes integration tests locally
[ https://issues.apache.org/jira/browse/SPARK-26343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718396#comment-16718396 ] Marcelo Vanzin commented on SPARK-26343: You can do that in sbt ("kubernetes-integration-tests/test"), but expanding that to maven would be good too. > Speed up running the kubernetes integration tests locally > - > > Key: SPARK-26343 > URL: https://issues.apache.org/jira/browse/SPARK-26343 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > > The Kubernetes integration tests right now allow you to specify a docker tag > but even when you do it also requires a tgz to extract, but then it doesn't > really need that extracted version. We could make it easier/faster for folks > to run the integration tests locally by not requiring a distribution tar ball. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26342) Support for NFS mount for Kubernetes
Eric Carlson created SPARK-26342: Summary: Support for NFS mount for Kubernetes Key: SPARK-26342 URL: https://issues.apache.org/jira/browse/SPARK-26342 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Eric Carlson Currently only hostPath, emptyDir, and PVC volume types are accepted for Kubernetes-deployed drivers and executors. Possibility to mount NFS paths would allow access to a common and easy-to-deploy shared storage solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26344) Support for flexVolume mount for Kubernetes
Eric Carlson created SPARK-26344: Summary: Support for flexVolume mount for Kubernetes Key: SPARK-26344 URL: https://issues.apache.org/jira/browse/SPARK-26344 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Eric Carlson Currently only hostPath, emptyDir, and PVC volume types are accepted for Kubernetes-deployed drivers and executors. flexVolume types allow for pluggable volume drivers to be used in Kubernetes - a widely used example of this is the Rook deployment of CephFS, which provides a POSIX-compliant distributed filesystem integrated into K8s. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
[ https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718387#comment-16718387 ] ASF GitHub Bot commented on SPARK-26333: deshanxiao closed pull request #23282: [SPARK-26333][TEST]FsHistoryProviderSuite failed because setReadable doesn't work in RedHat URL: https://github.com/apache/spark/pull/23282 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala index c1ae27aa940f6..e376ce85be345 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala @@ -176,6 +176,7 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc SparkListenerApplicationEnd(2L) ) logFile2.setReadable(false, false) +assume(!logFile2.canRead) updateAndCheck(provider) { list => list.size should be (1) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > FsHistoryProviderSuite failed because setReadable doesn't work in RedHat > > > Key: SPARK-26333 > URL: https://issues.apache.org/jira/browse/SPARK-26333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: deshanxiao >Priority: Major > > FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot > be read.". I try to invoke logFile2.canRead after invoking > "setReadable(false, false)" . And I find that the result of > "logFile2.canRead" is true but in my ubuntu16.04 return false. > The environment: > RedHat: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > JDK > Java version: 1.8.0_151, vendor: Oracle Corporation > {code:java} > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at
[jira] [Created] (SPARK-26343) Running the kubernetes
holdenk created SPARK-26343: --- Summary: Running the kubernetes Key: SPARK-26343 URL: https://issues.apache.org/jira/browse/SPARK-26343 Project: Spark Issue Type: Improvement Components: Kubernetes, Tests Affects Versions: 3.0.0 Reporter: holdenk Assignee: holdenk The Kubernetes integration tests right now allow you to specify a docker tag but even when you do it also requires a tgz to extract, but then it doesn't really need that extracted version. We could make it easier/faster for folks to run the integration tests locally by not requiring a distribution tar ball. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26343) Speed up running the kubernetes integration tests locally
[ https://issues.apache.org/jira/browse/SPARK-26343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-26343: Summary: Speed up running the kubernetes integration tests locally (was: Running the kubernetes ) > Speed up running the kubernetes integration tests locally > - > > Key: SPARK-26343 > URL: https://issues.apache.org/jira/browse/SPARK-26343 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > > The Kubernetes integration tests right now allow you to specify a docker tag > but even when you do it also requires a tgz to extract, but then it doesn't > really need that extracted version. We could make it easier/faster for folks > to run the integration tests locally by not requiring a distribution tar ball. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
[ https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718385#comment-16718385 ] deshanxiao commented on SPARK-26333: [~vanzin] Yes, you are right! Thank you very much! But why setReadable doesn't work as root? > FsHistoryProviderSuite failed because setReadable doesn't work in RedHat > > > Key: SPARK-26333 > URL: https://issues.apache.org/jira/browse/SPARK-26333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: deshanxiao >Priority: Major > > FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot > be read.". I try to invoke logFile2.canRead after invoking > "setReadable(false, false)" . And I find that the result of > "logFile2.canRead" is true but in my ubuntu16.04 return false. > The environment: > RedHat: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > JDK > Java version: 1.8.0_151, vendor: Oracle Corporation > {code:java} > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL
[ https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718368#comment-16718368 ] ASF GitHub Bot commented on SPARK-26193: asfgit closed pull request #23286: [SPARK-26193][SQL][Follow Up] Read metrics rename and display text changes URL: https://github.com/apache/spark/pull/23286 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala index 2a8d1dd995e27..35664ff515d4b 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala @@ -92,7 +92,7 @@ private[spark] class ShuffleMapTask( threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime } else 0L -dep.shuffleWriterProcessor.writeProcess(rdd, dep, partitionId, context, partition) +dep.shuffleWriterProcessor.write(rdd, dep, partitionId, context, partition) } override def preferredLocations: Seq[TaskLocation] = preferredLocs diff --git a/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala b/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala index f5213157a9a85..5b0c7e9f2b0b4 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala @@ -41,7 +41,7 @@ private[spark] class ShuffleWriteProcessor extends Serializable with Logging { * get from [[ShuffleManager]] and triggers rdd compute, finally return the [[MapStatus]] for * this task. */ - def writeProcess( + def write( rdd: RDD[_], dep: ShuffleDependency[_, _, _], partitionId: Int, diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala index 9b05faaed0459..079ff25fcb67e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala @@ -22,7 +22,7 @@ import java.util.Arrays import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.execution.metric.{SQLMetric, SQLShuffleMetricsReporter} +import org.apache.spark.sql.execution.metric.{SQLMetric, SQLShuffleReadMetricsReporter} /** * The [[Partition]] used by [[ShuffledRowRDD]]. A post-shuffle partition @@ -157,9 +157,9 @@ class ShuffledRowRDD( override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = { val shuffledRowPartition = split.asInstanceOf[ShuffledRowRDDPartition] val tempMetrics = context.taskMetrics().createTempShuffleReadMetrics() -// `SQLShuffleMetricsReporter` will update its own metrics for SQL exchange operator, +// `SQLShuffleReadMetricsReporter` will update its own metrics for SQL exchange operator, // as well as the `tempMetrics` for basic shuffle metrics. -val sqlMetricsReporter = new SQLShuffleMetricsReporter(tempMetrics, metrics) +val sqlMetricsReporter = new SQLShuffleReadMetricsReporter(tempMetrics, metrics) // The range of pre-shuffle partitions that we are fetching at here is // [startPreShufflePartitionIndex, endPreShufflePartitionIndex - 1]. val reader = diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala index 0c2020572e721..da7b0c6f43fbc 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala @@ -31,7 +31,7 @@ import org.apache.spark.sql.catalyst.expressions.{Attribute, BoundReference, Uns import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering import org.apache.spark.sql.catalyst.plans.physical._ import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics, SQLShuffleMetricsReporter, SQLShuffleWriteMetricsReporter} +import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics, SQLShuffleReadMetricsReporter, SQLShuffleWriteMetricsReporter} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.util.MutablePair @@ -50,7 +50,7 @@ case class ShuffleExchangeExec( private lazy val writeMetrics =
[jira] [Issue Comment Deleted] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
[ https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deshanxiao updated SPARK-26333: --- Comment: was deleted (was: [~vanzin] No, I am not running as root.) > FsHistoryProviderSuite failed because setReadable doesn't work in RedHat > > > Key: SPARK-26333 > URL: https://issues.apache.org/jira/browse/SPARK-26333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: deshanxiao >Priority: Major > > FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot > be read.". I try to invoke logFile2.canRead after invoking > "setReadable(false, false)" . And I find that the result of > "logFile2.canRead" is true but in my ubuntu16.04 return false. > The environment: > RedHat: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > JDK > Java version: 1.8.0_151, vendor: Oracle Corporation > {code:java} > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
[ https://issues.apache.org/jira/browse/SPARK-26333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718337#comment-16718337 ] deshanxiao commented on SPARK-26333: [~vanzin] No, I am not running as root. > FsHistoryProviderSuite failed because setReadable doesn't work in RedHat > > > Key: SPARK-26333 > URL: https://issues.apache.org/jira/browse/SPARK-26333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: deshanxiao >Priority: Major > > FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot > be read.". I try to invoke logFile2.canRead after invoking > "setReadable(false, false)" . And I find that the result of > "logFile2.canRead" is true but in my ubuntu16.04 return false. > The environment: > RedHat: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > JDK > Java version: 1.8.0_151, vendor: Oracle Corporation > {code:java} > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20628) Keep track of nodes which are going to be shut down & avoid scheduling new tasks
[ https://issues.apache.org/jira/browse/SPARK-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718305#comment-16718305 ] ASF GitHub Bot commented on SPARK-20628: SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446420850 **[Test build #7 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/7/testReport)** for PR 19045 at commit [`af048f5`](https://github.com/apache/spark/commit/af048f5753cd99b68d2e5f8d268c52a119a2d84a). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep track of nodes which are going to be shut down & avoid scheduling new > tasks > > > Key: SPARK-20628 > URL: https://issues.apache.org/jira/browse/SPARK-20628 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: holdenk >Priority: Major > > Keep track of nodes which are going to be shut down. We considered adding > this for YARN but took a different approach, for instances where we can't > control instance termination though (EC2, GCE, etc.) this may make more sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718296#comment-16718296 ] ASF GitHub Bot commented on SPARK-26265: viirya commented on issue #23289: [SPARK-26265][Core][BRANCH-2.4] Fix deadlock in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator and TaskMemoryManager URL: https://github.com/apache/spark/pull/23289#issuecomment-446418768 Thanks @dongjoon-hyun @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator > -- > > Key: SPARK-26265 > URL: https://issues.apache.org/jira/browse/SPARK-26265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: qian han >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > The application is running on a cluster with 72000 cores and 182000G mem. > Enviroment: > |spark.dynamicAllocation.minExecutors|5| > |spark.dynamicAllocation.initialExecutors|30| > |spark.dynamicAllocation.maxExecutors|400| > |spark.executor.cores|4| > |spark.executor.memory|20g| > > > Stage description: > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193) > > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > jstack information as follow: > Found one Java-level deadlock: = > "Thread-ScriptTransformation-Feed": waiting to lock monitor > 0x00e0cb18 (object 0x0002f1641538, a > org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task > launch worker for task 18899" "Executor task launch worker for task 18899": > waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by > "Thread-ScriptTransformation-Feed" Java stack information for the threads > listed above: === > "Thread-ScriptTransformation-Feed": at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332) > - waiting to lock <0x0002f1641538> (a > org.apache.spark.memory.TaskMemoryManager) at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at > org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66) > at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274) > - locked <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at >
[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718297#comment-16718297 ] ASF GitHub Bot commented on SPARK-26265: viirya closed pull request #23289: [SPARK-26265][Core][BRANCH-2.4] Fix deadlock in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator and TaskMemoryManager URL: https://github.com/apache/spark/pull/23289 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java index 9b6cbab38cbcc..64650336c9371 100644 --- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java +++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java @@ -267,11 +267,18 @@ private MapIterator(int numRecords, Location loc, boolean destructive) { } private void advanceToNextPage() { + // SPARK-26265: We will first lock this `MapIterator` and then `TaskMemoryManager` when going + // to free a memory page by calling `freePage`. At the same time, it is possibly that another + // memory consumer first locks `TaskMemoryManager` and then this `MapIterator` when it + // acquires memory and causes spilling on this `MapIterator`. To avoid deadlock here, we keep + // reference to the page to free and free it after releasing the lock of `MapIterator`. + MemoryBlock pageToFree = null; + synchronized (this) { int nextIdx = dataPages.indexOf(currentPage) + 1; if (destructive && currentPage != null) { dataPages.remove(currentPage); - freePage(currentPage); + pageToFree = currentPage; nextIdx --; } if (dataPages.size() > nextIdx) { @@ -295,6 +302,9 @@ private void advanceToNextPage() { } } } + if (pageToFree != null) { +freePage(pageToFree); + } } @Override diff --git a/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java b/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java index 0bbaea6b834b8..6aa577d1bf797 100644 --- a/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java +++ b/core/src/test/java/org/apache/spark/memory/TestMemoryConsumer.java @@ -38,12 +38,12 @@ public long spill(long size, MemoryConsumer trigger) throws IOException { return used; } - void use(long size) { + public void use(long size) { long got = taskMemoryManager.acquireExecutionMemory(size, this); used += got; } - void free(long size) { + public void free(long size) { used -= size; taskMemoryManager.releaseExecutionMemory(size, this); } diff --git a/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java b/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java index 53a233f698c7a..278d28f7bf479 100644 --- a/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java +++ b/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java @@ -33,6 +33,8 @@ import org.apache.spark.SparkConf; import org.apache.spark.executor.ShuffleWriteMetrics; +import org.apache.spark.memory.MemoryMode; +import org.apache.spark.memory.TestMemoryConsumer; import org.apache.spark.memory.TaskMemoryManager; import org.apache.spark.memory.TestMemoryManager; import org.apache.spark.network.util.JavaUtils; @@ -667,4 +669,49 @@ public void testPeakMemoryUsed() { } } + @Test + public void avoidDeadlock() throws InterruptedException { +memoryManager.limit(PAGE_SIZE_BYTES); +MemoryMode mode = useOffHeapMemoryAllocator() ? MemoryMode.OFF_HEAP: MemoryMode.ON_HEAP; +TestMemoryConsumer c1 = new TestMemoryConsumer(taskMemoryManager, mode); +BytesToBytesMap map = + new BytesToBytesMap(taskMemoryManager, blockManager, serializerManager, 1, 0.5, 1024, false); + +Thread thread = new Thread(() -> { + int i = 0; + long used = 0; + while (i < 10) { +c1.use(1000); +used += 1000; +i++; + } + c1.free(used); +}); + +try { + int i; + for (i = 0; i < 1024; i++) { +final long[] arr = new long[]{i}; +final BytesToBytesMap.Location loc = map.lookup(arr, Platform.LONG_ARRAY_OFFSET, 8); +loc.append(arr, Platform.LONG_ARRAY_OFFSET, 8, arr, Platform.LONG_ARRAY_OFFSET, 8); + } + + // Starts to require memory at another memory consumer. + thread.start(); + + BytesToBytesMap.MapIterator iter = map.destructiveIterator(); + for (i = 0; i < 1024; i++) { +
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718298#comment-16718298 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446418872 @vanzin IMHO we can't go with approach #20326 went since cluster ID is not available in default log URL so cannot retrieve. Also for me relying on pattern (`container_`) in URL feels fragile: while #20326 only extracted container ID from URL, we need to get some other parameters as well, which opens bigger possibility to be broken. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718246#comment-16718246 ] ASF GitHub Bot commented on SPARK-24561: SparkQA commented on issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#issuecomment-446412545 **[Test build #99989 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99989/testReport)** for PR 22305 at commit [`5d3bbd6`](https://github.com/apache/spark/commit/5d3bbd6bfbdd6b332c65f9748ab066d9eb6a480d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718250#comment-16718250 ] ASF GitHub Bot commented on SPARK-24561: AmplabJenkins removed a comment on issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#issuecomment-446412883 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL
[ https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718254#comment-16718254 ] ASF GitHub Bot commented on SPARK-26193: rxin commented on issue #23286: [SPARK-26193][SQL][Follow Up] Read metrics rename and display text changes URL: https://github.com/apache/spark/pull/23286#issuecomment-446413488 LGTM too This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement shuffle write metrics in SQL > -- > > Key: SPARK-26193 > URL: https://issues.apache.org/jira/browse/SPARK-26193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718253#comment-16718253 ] ASF GitHub Bot commented on SPARK-24561: HyukjinKwon commented on a change in pull request #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#discussion_r240841612 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/AggregateProcessor.scala ## @@ -113,7 +113,7 @@ private[window] object AggregateProcessor { * This class manages the processing of a number of aggregate functions. See the documentation of * the object for more information. */ -private[window] final class AggregateProcessor( +private[sql] final class AggregateProcessor( Review comment: Yea, see also some JIRAs like SPARK-16964. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718251#comment-16718251 ] ASF GitHub Bot commented on SPARK-24561: AmplabJenkins removed a comment on issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#issuecomment-446412887 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99989/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718248#comment-16718248 ] ASF GitHub Bot commented on SPARK-24561: AmplabJenkins commented on issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#issuecomment-446412883 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718249#comment-16718249 ] ASF GitHub Bot commented on SPARK-24561: AmplabJenkins commented on issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#issuecomment-446412887 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99989/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)
[ https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718247#comment-16718247 ] ASF GitHub Bot commented on SPARK-24561: SparkQA removed a comment on issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#issuecomment-446348650 **[Test build #99989 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99989/testReport)** for PR 22305 at commit [`5d3bbd6`](https://github.com/apache/spark/commit/5d3bbd6bfbdd6b332c65f9748ab066d9eb6a480d). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > User-defined window functions with pandas udf (bounded window) > -- > > Key: SPARK-24561 > URL: https://issues.apache.org/jira/browse/SPARK-24561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718222#comment-16718222 ] ASF GitHub Bot commented on SPARK-25642: AmplabJenkins removed a comment on issue #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#issuecomment-446407716 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718216#comment-16718216 ] ASF GitHub Bot commented on SPARK-25642: vanzin commented on a change in pull request #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#discussion_r240835732 ## File path: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ## @@ -199,6 +191,18 @@ protected void serviceInit(Configuration conf) throws Exception { port = shuffleServer.getPort(); boundPort = port; String authEnabledString = authEnabled ? "enabled" : "not enabled"; + + // register metrics on the block handler into the Node Manager's metrics system. + blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", + shuffleServer.getRegisteredConnections()); Review comment: nit: indented too far, here and in others. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718217#comment-16718217 ] ASF GitHub Bot commented on SPARK-25642: vanzin commented on issue #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#issuecomment-446406987 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718223#comment-16718223 ] ASF GitHub Bot commented on SPARK-25642: AmplabJenkins removed a comment on issue #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#issuecomment-446407720 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5992/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26321) Split a SQL in a correct way
[ https://issues.apache.org/jira/browse/SPARK-26321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718213#comment-16718213 ] ASF GitHub Bot commented on SPARK-26321: sadhen commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446406264 OK I will add more desc. @srowen This is actually a trivial PR!Please review. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Split a SQL in a correct way > > > Key: SPARK-26321 > URL: https://issues.apache.org/jira/browse/SPARK-26321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Priority: Major > > First: > ./build/mvn -Phive-thriftserver -DskipTests package > > Then: > $ bin/spark-sql > > 18/12/10 19:35:02 INFO SparkSQLCLIDriver: Time taken: 4.483 seconds, Fetched > 1 row(s) > spark-sql> select "1;2"; > Error in query: > no viable alternative at input 'select "'(line 1, pos 7) > == SQL == > select "1 > ---^^^ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718220#comment-16718220 ] ASF GitHub Bot commented on SPARK-25642: AmplabJenkins commented on issue #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#issuecomment-446407720 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5992/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718221#comment-16718221 ] ASF GitHub Bot commented on SPARK-25642: SparkQA commented on issue #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#issuecomment-446407732 **[Test build #6 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/6/testReport)** for PR 22498 at commit [`70472a2`](https://github.com/apache/spark/commit/70472a255e5da3ea4522959e26f5c403641e1ce6). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25642) Add new Metrics in External Shuffle Service to help determine Network performance and Connection Handling capabilities of the Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-25642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718219#comment-16718219 ] ASF GitHub Bot commented on SPARK-25642: AmplabJenkins commented on issue #22498: [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service URL: https://github.com/apache/spark/pull/22498#issuecomment-446407716 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add new Metrics in External Shuffle Service to help determine Network > performance and Connection Handling capabilities of the Shuffle Service > - > > Key: SPARK-25642 > URL: https://issues.apache.org/jira/browse/SPARK-25642 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > > Recently, the ability to expose the metrics for YARN Shuffle Service was > added as part of [SPARK-18364|[https://github.com/apache/spark/pull/22485]]. > We need to add some metrics to be able to determine the number of active > connections as well as open connections to the external shuffle service to > benchmark network and connection issues on large cluster environments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718209#comment-16718209 ] ASF GitHub Bot commented on SPARK-24938: AmplabJenkins removed a comment on issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memory for headers without regard for configuration URL: https://github.com/apache/spark/pull/22114#issuecomment-446405325 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718192#comment-16718192 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718210#comment-16718210 ] ASF GitHub Bot commented on SPARK-24938: AmplabJenkins removed a comment on issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memory for headers without regard for configuration URL: https://github.com/apache/spark/pull/22114#issuecomment-446405327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5991/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself
[ https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718208#comment-16718208 ] ASF GitHub Bot commented on SPARK-25877: mccheah commented on issue #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#issuecomment-446405494 +1 from me, would like @liyinan926 to take a second look This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Put all feature-related code in the feature step itself > --- > > Key: SPARK-25877 > URL: https://issues.apache.org/jira/browse/SPARK-25877 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a child task of SPARK-25874. It covers having all the code related to > features in the feature steps themselves, including logic about whether a > step should be applied or not. > Please refer to the parent bug for further details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718205#comment-16718205 ] ASF GitHub Bot commented on SPARK-24938: AmplabJenkins commented on issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memory for headers without regard for configuration URL: https://github.com/apache/spark/pull/22114#issuecomment-446405325 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718207#comment-16718207 ] ASF GitHub Bot commented on SPARK-24938: SparkQA commented on issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memory for headers without regard for configuration URL: https://github.com/apache/spark/pull/22114#issuecomment-446405364 **[Test build #5 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/testReport)** for PR 22114 at commit [`c2f9ed1`](https://github.com/apache/spark/commit/c2f9ed10776842ffe0746fcc89b157675fa6c455). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718206#comment-16718206 ] ASF GitHub Bot commented on SPARK-24938: AmplabJenkins commented on issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memory for headers without regard for configuration URL: https://github.com/apache/spark/pull/22114#issuecomment-446405327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5991/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself
[ https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718201#comment-16718201 ] ASF GitHub Bot commented on SPARK-25877: mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240833837 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { Review comment: > That's also an argument for not restoring the mocks, which would go against what this change is doing. This test should account for modifications made by other steps, since if they modify something unexpected, that can change the semantics of the feature (pod template support). Wouldn't most of those unexpected changes come from the unit tests of the individual steps? Granted this test can catch when a change in one step impacts behavior in another step, which is important. Given that this isn't changing prior code I'm fine with leaving this as-is and addressing again later if it becomes a problem. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Put all feature-related code in the feature step itself > --- > > Key: SPARK-25877 > URL: https://issues.apache.org/jira/browse/SPARK-25877 >
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718199#comment-16718199 ] ASF GitHub Bot commented on SPARK-24938: vanzin commented on issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memory for headers without regard for configuration URL: https://github.com/apache/spark/pull/22114#issuecomment-446404726 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718197#comment-16718197 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
[ https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718180#comment-16718180 ] ASF GitHub Bot commented on SPARK-13356: vanzin commented on issue #11228: [SPARK-13356][Streaming]WebUI missing input informations when recovering from dirver failure URL: https://github.com/apache/spark/pull/11228#issuecomment-446402361 This looks very out of date. I'll close this for now, but if you want to update and reopen the PR, just push to your branch. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > WebUI missing input informations when recovering from dirver failure > > > Key: SPARK-13356 > URL: https://issues.apache.org/jira/browse/SPARK-13356 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.2, 1.6.0 >Reporter: jeanlyn >Priority: Minor > Attachments: DirectKafkaScreenshot.jpg > > > WebUI missing some input information when streaming recover from checkpoint, > it may confuse people the data had lose when recover from failure. > For example: > !DirectKafkaScreenshot.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718194#comment-16718194 ] ASF GitHub Bot commented on SPARK-26203: SparkQA removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531 **[Test build #3 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)** for PR 23291 at commit [`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718193#comment-16718193 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404207 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718195#comment-16718195 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404202 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718191#comment-16718191 ] ASF GitHub Bot commented on SPARK-26203: SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446404185 **[Test build #3 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)** for PR 23291 at commit [`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718183#comment-16718183 ] ASF GitHub Bot commented on SPARK-26311: vanzin commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446402857 This is the other PR I was referring to: #20326 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
[ https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718181#comment-16718181 ] ASF GitHub Bot commented on SPARK-13356: vanzin closed pull request #11228: [SPARK-13356][Streaming]WebUI missing input informations when recovering from dirver failure URL: https://github.com/apache/spark/pull/11228 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala b/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala index 7e57bb18cbd50..01404b7a53540 100644 --- a/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala +++ b/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala @@ -313,16 +313,32 @@ private[spark] class DirectKafkaInputDStream[K, V]( override def restore(): Unit = { batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) => - logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") - generatedRDDs += t -> new KafkaRDD[K, V]( - context.sparkContext, - executorKafkaParams, - b.map(OffsetRange(_)), - getPreferredHosts, - // during restore, it's possible same partition will be consumed from multiple - // threads, so dont use cache - false - ) +logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") +val recoverOffsets = b.map(OffsetRange(_)) +val rdd = new KafkaRDD[K, V]( + context.sparkContext, + executorKafkaParams, + recoverOffsets, + getPreferredHosts, + // during restore, it's possible same partition will be consumed from multiple + // threads, so dont use cache + false +) +// Report the record number and metadata of this batch interval to InputInfoTracker. +val description = recoverOffsets.filter { offsetRange => + // Don't display empty ranges. + offsetRange.fromOffset != offsetRange.untilOffset +}.map { offsetRange => + s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" + +s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}" +}.mkString("\n") +// Copy offsetRanges to immutable.List to prevent from being modified by the user +val metadata = Map( + "offsets" -> recoverOffsets.toList, + StreamInputInfo.METADATA_KEY_DESCRIPTION -> description) +val inputInfo = StreamInputInfo(id, rdd.count, metadata) +ssc.scheduler.inputInfoTracker.reportInfo(t, inputInfo) +generatedRDDs += t -> rdd } } } diff --git a/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala b/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala index c3c799375bbeb..274d1cbbb07e1 100644 --- a/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala +++ b/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala @@ -210,9 +210,26 @@ class DirectKafkaInputDStream[ val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics)) batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) => - logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") - generatedRDDs += t -> new KafkaRDD[K, V, U, T, R]( - context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, messageHandler) +logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") +val recoverOffsets = b.map(OffsetRange(_)) +val rdd = new KafkaRDD[K, V, U, T, R]( + context.sparkContext, kafkaParams, recoverOffsets, leaders, messageHandler) + +val description = recoverOffsets.filter { offsetRange => + // Don't display empty ranges. + offsetRange.fromOffset != offsetRange.untilOffset +}.map { offsetRange => + s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" + +s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}" +}.mkString("\n") + +// Copy offsetRanges to immutable.List to prevent from being modified by the user +val metadata = Map( + "offsets" -> recoverOffsets.toList, + StreamInputInfo.METADATA_KEY_DESCRIPTION -> description) +val inputInfo
[jira] [Commented] (SPARK-25064) Total Tasks in WebUI does not match Active+Failed+Complete Tasks
[ https://issues.apache.org/jira/browse/SPARK-25064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718179#comment-16718179 ] ASF GitHub Bot commented on SPARK-25064: vanzin commented on issue #22051: [SPARK-25064][WEBUI] Add killed tasks count info to WebUI URL: https://github.com/apache/spark/pull/22051#issuecomment-446402102 Hmm, I'm not so sure that "tasks killed on executor x" is a very interesting metric. If this information was altogether missing, sure, but this seems to only touch the per-executor information. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Total Tasks in WebUI does not match Active+Failed+Complete Tasks > > > Key: SPARK-25064 > URL: https://issues.apache.org/jira/browse/SPARK-25064 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Minor > Attachments: 1533128402933_3.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718177#comment-16718177 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718176#comment-16718176 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins removed a comment on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718174#comment-16718174 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401600 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5989/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718173#comment-16718173 ] ASF GitHub Bot commented on SPARK-26203: AmplabJenkins commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401593 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718172#comment-16718172 ] ASF GitHub Bot commented on SPARK-26203: SparkQA commented on issue #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#issuecomment-446401531 **[Test build #3 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/testReport)** for PR 23291 at commit [`987bea4`](https://github.com/apache/spark/commit/987bea48350ed2e3862b965e07d5d5335e1d86c2). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Benchmark performance of In and InSet expressions > - > > Key: SPARK-26203 > URL: https://issues.apache.org/jira/browse/SPARK-26203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible > values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values > are literals. This was done for performance reasons to avoid O\(n\) time > complexity for {{In}}. > The original optimization was done in SPARK-3711. A lot has changed after > that (e.g., generation of Java code to evaluate expressions), so it is worth > to measure the performance of this optimization again. > According to my local benchmarks, {{InSet}} can be up to 10x time slower than > {{In}} due to autoboxing and other issues. > The scope of this JIRA is to benchmark every supported data type inside > {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this > information, we can come up with solutions. > Based on my preliminary investigation, we can do quite some optimizations, > which quite frequently depend on a specific data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22640) Can not switch python exec in executor side
[ https://issues.apache.org/jira/browse/SPARK-22640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22640: -- Assignee: Marcelo Vanzin > Can not switch python exec in executor side > --- > > Key: SPARK-22640 > URL: https://issues.apache.org/jira/browse/SPARK-22640 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Kent Yao >Assignee: Marcelo Vanzin >Priority: Major > > {code:java} > PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \ > bin/spark-submit --master yarn --deploy-mode client \ > --archives ~/anaconda3/envs/py3.zip \ > --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \ > --conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \ > /home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py > {code} > In the case above, I created a python environment, delivered it via > `--arichives`, then visited it on Executor Node via > `spark.executorEnv.PYSPARK_PYTHON`. > But Executor seemed to use `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` > instead of `spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python` > {code:java} > java.io.IOException: Cannot run program > "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or > directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: error=2, No such file or directory > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:248) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 23 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22640) Can not switch python exec in executor side
[ https://issues.apache.org/jira/browse/SPARK-22640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22640: -- Assignee: (was: Marcelo Vanzin) > Can not switch python exec in executor side > --- > > Key: SPARK-22640 > URL: https://issues.apache.org/jira/browse/SPARK-22640 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \ > bin/spark-submit --master yarn --deploy-mode client \ > --archives ~/anaconda3/envs/py3.zip \ > --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \ > --conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \ > /home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py > {code} > In the case above, I created a python environment, delivered it via > `--arichives`, then visited it on Executor Node via > `spark.executorEnv.PYSPARK_PYTHON`. > But Executor seemed to use `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` > instead of `spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python` > {code:java} > java.io.IOException: Cannot run program > "/home/hadoop/anaconda3/envs/py3/bin/python": error=2, No such file or > directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: error=2, No such file or directory > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:248) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 23 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718140#comment-16718140 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446394147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/2/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718142#comment-16718142 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446394143 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718143#comment-16718143 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446394147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/2/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718139#comment-16718139 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446394143 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718137#comment-16718137 ] ASF GitHub Bot commented on SPARK-26311: SparkQA removed a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446389268 **[Test build #2 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/testReport)** for PR 23260 at commit [`4c865fd`](https://github.com/apache/spark/commit/4c865fdc06b327bc9fb5865d9b5cbd600a892539). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718136#comment-16718136 ] ASF GitHub Bot commented on SPARK-26311: SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446394066 **[Test build #2 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/testReport)** for PR 23260 at commit [`4c865fd`](https://github.com/apache/spark/commit/4c865fdc06b327bc9fb5865d9b5cbd600a892539). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11085) Add support for HTTP proxy
[ https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11085. Resolution: Duplicate I believe SPARK-17568 should take care of this, since that allows you to use a custom Ivy config file. > Add support for HTTP proxy > --- > > Key: SPARK-11085 > URL: https://issues.apache.org/jira/browse/SPARK-11085 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Reporter: Dustin Cote >Priority: Minor > > Add a way to update ivysettings.xml for the spark-shell and spark-submit to > support proxy settings for clusters that need to access a remote repository > through an http proxy. Typically this would be done like: > JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 > -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080" > Directly in the ivysettings.xml would look like: > > proxyport="8080" > nonproxyhosts="nonproxy.host"/> > > Even better would be a way to customize the ivysettings.xml with command > options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718112#comment-16718112 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240817633 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +253,57 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +val customLogUrl = sparkConf.get(config.CUSTOM_LOG_URL) + +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.buildLogUrls(customLogUrl, httpScheme, address, + clusterId, containerId, user, envNameToFileNameMap) +logUrls.foreach { case (envName, url) => + env(envName) = url +} } } env } } + +private[yarn] object ExecutorRunnable { + def buildLogUrls( +logUrlPattern: String, +httpScheme: String, +nodeHttpAddress: String, +clusterId: Option[String], +containerId: String, +user: String, +envNameToFileNameMap: Map[String, String]): Map[String, String] = { Review comment: Ah yes I just confused indent rule between method parameters and return... Nice catch. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718111#comment-16718111 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240817431 ## File path: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnLogUrlSuite.scala ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.yarn + +import org.apache.spark.SparkFunSuite + +class YarnLogUrlSuite extends SparkFunSuite { + + private val testHttpScheme = "https://; + private val testNodeHttpAddress = "nodeManager:1234" + private val testContainerId = "testContainer" + private val testUser = "testUser" + private val testEnvNameToFileNameMap = Map("TEST_ENV_STDOUT" -> "stdout", +"TEST_ENV_STDERR" -> "stderr") + + test("Custom log URL - leverage all patterns, all values for patterns are available") { +val logUrlPattern = "{{HttpScheme}}{{NodeHttpAddress}}/logs/clusters/{{ClusterId}}" + + "/containers/{{ContainerId}}/users/{{User}}/files/{{FileName}}" + +val clusterId = Some("testCluster") + +val logUrls = ExecutorRunnable.buildLogUrls(logUrlPattern, testHttpScheme, testNodeHttpAddress, + clusterId, testContainerId, testUser, testEnvNameToFileNameMap) + +val expectedLogUrls = testEnvNameToFileNameMap.map { case (envName, fileName) => + envName -> (s"$testHttpScheme$testNodeHttpAddress/logs/clusters/${clusterId.get}" + +s"/containers/$testContainerId/users/$testUser/files/$fileName") +} + +assert(logUrls === expectedLogUrls) + } + + test("Custom log URL - optional pattern is not used in log URL") { +// here {{ClusterId}} is excluded in this pattern +val logUrlPattern = "{{HttpScheme}}{{NodeHttpAddress}}/logs/containers/{{ContainerId}}" + + "/users/{{User}}/files/{{FileName}}" + +// suppose the value of {{ClusterId}} pattern is not available +val clusterId = None + +// This should not throw an exception: the value for optional pattern is not available +// but we also don't use the pattern in log URL. +val logUrls = ExecutorRunnable.buildLogUrls(logUrlPattern, testHttpScheme, testNodeHttpAddress, + clusterId, testContainerId, testUser, testEnvNameToFileNameMap) + +val expectedLogUrls = testEnvNameToFileNameMap.map { case (envName, fileName) => + envName -> (s"$testHttpScheme$testNodeHttpAddress/logs/containers/$testContainerId" + +s"/users/$testUser/files/$fileName") +} + +assert(logUrls === expectedLogUrls) + } + + test("Custom log URL - optional pattern is used in log URL but the value " + +"is not present") { Review comment: Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718107#comment-16718107 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446387143 > My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on. Yes, exactly. That's one of issue this patch enables to deal with, and another one would be cluster awareness. The existence of `the clusterId of RM` represents that YARN opens the possibility of maintaining multiple YARN clusters and provides centralized services which operates with multiple YARN clusters. > when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful. > I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal. I agree the case is rather not against running applications but finished applications. Currently Spark just sets executor log urls in environment at resource manager side and uses them. The usages are broad, and not sure we can determine which resource manager the application is based on, and whether application is running or finished in all usages. (I'm not familiar with UI side.) So this patch tackles the easiest way to deal with. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718114#comment-16718114 ] ASF GitHub Bot commented on SPARK-26311: SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446389268 **[Test build #2 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/2/testReport)** for PR 23260 at commit [`4c865fd`](https://github.com/apache/spark/commit/4c865fdc06b327bc9fb5865d9b5cbd600a892539). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718109#comment-16718109 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR edited a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446387143 @squito @vanzin > I assume for your setup, you are not using {{NodeHttpAddress}} and are replacing it with something else centralized? Yes, exactly. > My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on. Yes, exactly. That's one of issue this patch enables to deal with, and another one would be cluster awareness. The existence of `the clusterId of RM` represents that YARN opens the possibility of maintaining multiple YARN clusters and provides centralized services which operates with multiple YARN clusters. > when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful. > I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal. I agree the case is rather not against running applications but finished applications. Currently Spark just sets executor log urls in environment at resource manager side and uses them. The usages are broad, and not sure we can determine which resource manager the application is based on, and whether application is running or finished in all usages. (I'm not familiar with UI side.) So this patch tackles the easiest way to deal with. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718108#comment-16718108 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR edited a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446387143 @squito @vanzin > My understanding is that this allows pointing the Spark UI directly at the history server (old JHS or new ATS) instead of hardcoding the NM URL and relying on the NM redirecting you, since the NM may not exist later on. Yes, exactly. That's one of issue this patch enables to deal with, and another one would be cluster awareness. The existence of `the clusterId of RM` represents that YARN opens the possibility of maintaining multiple YARN clusters and provides centralized services which operates with multiple YARN clusters. > when perhaps if there was a way to hook this up on the Spark history server side only, that may be more useful. > I think someone tried that in the past but the SHS change was very YARN-specific, which made it kinda sub-optimal. I agree the case is rather not against running applications but finished applications. Currently Spark just sets executor log urls in environment at resource manager side and uses them. The usages are broad, and not sure we can determine which resource manager the application is based on, and whether application is running or finished in all usages. (I'm not familiar with UI side.) So this patch tackles the easiest way to deal with. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718085#comment-16718085 ] ASF GitHub Bot commented on SPARK-26322: SparkQA commented on issue #23274: [SPARK-26322][SS] Add spark.kafka.sasl.token.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446381197 **[Test build #1 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/1/testReport)** for PR 23274 at commit [`de35aa2`](https://github.com/apache/spark/commit/de35aa2479400c11a48658501a838bebd83553cd). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718083#comment-16718083 ] ASF GitHub Bot commented on SPARK-26322: gaborgsomogyi commented on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446380248 retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself
[ https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718071#comment-16718071 ] ASF GitHub Bot commented on SPARK-25877: vanzin commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240805992 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { +val metadata = pod.pod.getMetadata +assert(metadata.getLabels.containsKey("test-label-key")) +assert(metadata.getAnnotations.containsKey("test-annotation-key")) +assert(metadata.getNamespace === "namespace") +assert(metadata.getOwnerReferences.asScala.exists(_.getName == "owner-reference")) +val spec = pod.pod.getSpec +assert(!spec.getContainers.asScala.exists(_.getName == "executor-container")) +assert(spec.getDnsPolicy === "dns-policy") +assert(spec.getHostAliases.asScala.exists(_.getHostnames.asScala.exists(_ == "hostname"))) Review comment: I'm not modifying this code, just moving it from its previous location. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Put all feature-related code in the feature step itself > --- > > Key: SPARK-25877 > URL:
[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself
[ https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718070#comment-16718070 ] ASF GitHub Bot commented on SPARK-25877: vanzin commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240805965 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { +val metadata = pod.pod.getMetadata +assert(metadata.getLabels.containsKey("test-label-key")) +assert(metadata.getAnnotations.containsKey("test-annotation-key")) +assert(metadata.getNamespace === "namespace") +assert(metadata.getOwnerReferences.asScala.exists(_.getName == "owner-reference")) +val spec = pod.pod.getSpec +assert(!spec.getContainers.asScala.exists(_.getName == "executor-container")) Review comment: I'm not modifying this code, just moving it from its previous location. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Put all feature-related code in the feature step itself > --- > > Key: SPARK-25877 > URL: https://issues.apache.org/jira/browse/SPARK-25877 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718068#comment-16718068 ] ASF GitHub Bot commented on SPARK-26239: asfgit closed pull request #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala b/core/src/main/scala/org/apache/spark/SecurityManager.scala index 96e4b53b24181..15783c952c231 100644 --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala @@ -17,8 +17,11 @@ package org.apache.spark +import java.io.File import java.net.{Authenticator, PasswordAuthentication} import java.nio.charset.StandardCharsets.UTF_8 +import java.nio.file.Files +import java.util.Base64 import org.apache.hadoop.io.Text import org.apache.hadoop.security.{Credentials, UserGroupInformation} @@ -43,7 +46,8 @@ import org.apache.spark.util.Utils */ private[spark] class SecurityManager( sparkConf: SparkConf, -val ioEncryptionKey: Option[Array[Byte]] = None) +val ioEncryptionKey: Option[Array[Byte]] = None, +authSecretFileConf: ConfigEntry[Option[String]] = AUTH_SECRET_FILE) extends Logging with SecretKeyHolder { import SecurityManager._ @@ -328,6 +332,7 @@ private[spark] class SecurityManager( .orElse(Option(secretKey)) .orElse(Option(sparkConf.getenv(ENV_AUTH_SECRET))) .orElse(sparkConf.getOption(SPARK_AUTH_SECRET_CONF)) +.orElse(secretKeyFromFile()) .getOrElse { throw new IllegalArgumentException( s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config") @@ -348,7 +353,6 @@ private[spark] class SecurityManager( */ def initializeAuth(): Unit = { import SparkMasterRegex._ -val k8sRegex = "k8s.*".r if (!sparkConf.get(NETWORK_AUTH_ENABLED)) { return @@ -371,7 +375,14 @@ private[spark] class SecurityManager( return } -secretKey = Utils.createSecret(sparkConf) +if (sparkConf.get(AUTH_SECRET_FILE_DRIVER).isDefined != +sparkConf.get(AUTH_SECRET_FILE_EXECUTOR).isDefined) { + throw new IllegalArgumentException( +"Invalid secret configuration: Secret files must be specified for both the driver and the" + + " executors, not only one or the other.") +} + +secretKey = secretKeyFromFile().getOrElse(Utils.createSecret(sparkConf)) if (storeInUgi) { val creds = new Credentials() @@ -380,6 +391,22 @@ private[spark] class SecurityManager( } } + private def secretKeyFromFile(): Option[String] = { +sparkConf.get(authSecretFileConf).flatMap { secretFilePath => + sparkConf.getOption(SparkLauncher.SPARK_MASTER).map { +case k8sRegex() => + val secretFile = new File(secretFilePath) + require(secretFile.isFile, s"No file found containing the secret key at $secretFilePath.") + val base64Key = Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath)) + require(!base64Key.isEmpty, s"Secret key from file located at $secretFilePath is empty.") + base64Key +case _ => + throw new IllegalArgumentException( +"Secret keys provided via files is only allowed in Kubernetes mode.") + } +} + } + // Default SecurityManager only has a single secret key, so ignore appId. override def getSaslUser(appId: String): String = getSaslUser() override def getSecretKey(appId: String): String = getSecretKey() @@ -387,6 +414,7 @@ private[spark] class SecurityManager( private[spark] object SecurityManager { + val k8sRegex = "k8s.*".r val SPARK_AUTH_CONF = NETWORK_AUTH_ENABLED.key val SPARK_AUTH_SECRET_CONF = "spark.authenticate.secret" // This is used to set auth secret to an executor's env variable. It should have the same diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala b/core/src/main/scala/org/apache/spark/SparkEnv.scala index 66038eeaea54f..de0c8579d9acc 100644 --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala @@ -232,8 +232,8 @@ object SparkEnv extends Logging { if (isDriver) { assert(listenerBus != null, "Attempted to create driver SparkEnv with null listener bus!") } - -val securityManager = new SecurityManager(conf, ioEncryptionKey) +val authSecretFileConf = if (isDriver) AUTH_SECRET_FILE_DRIVER else AUTH_SECRET_FILE_EXECUTOR +val securityManager = new SecurityManager(conf,
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718066#comment-16718066 ] ASF GitHub Bot commented on SPARK-26322: AmplabJenkins removed a comment on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446376257 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99986/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25877) Put all feature-related code in the feature step itself
[ https://issues.apache.org/jira/browse/SPARK-25877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718067#comment-16718067 ] ASF GitHub Bot commented on SPARK-25877: vanzin commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240805744 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { Review comment: I actually did not write this test. I copy & pasted it with zero modifications from the previous class, and I'd prefer to keep it that way. That's also an argument for *not* restoring the mocks, which would go against what this change is doing. This test should account for modifications made by other steps, since if they modify something unexpected, that can change the semantics of the feature (pod template support). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Put all feature-related code in the feature step itself > --- > > Key: SPARK-25877 > URL: https://issues.apache.org/jira/browse/SPARK-25877 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718051#comment-16718051 ] ASF GitHub Bot commented on SPARK-19827: dongjoon-hyun commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240803593 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' Review comment: I'll be more careful about this roxygen2 stuff. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718064#comment-16718064 ] ASF GitHub Bot commented on SPARK-26322: AmplabJenkins removed a comment on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446376247 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718061#comment-16718061 ] ASF GitHub Bot commented on SPARK-26322: AmplabJenkins commented on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446376247 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718059#comment-16718059 ] ASF GitHub Bot commented on SPARK-26322: SparkQA removed a comment on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446303207 **[Test build #99986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99986/testReport)** for PR 23274 at commit [`de35aa2`](https://github.com/apache/spark/commit/de35aa2479400c11a48658501a838bebd83553cd). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718060#comment-16718060 ] ASF GitHub Bot commented on SPARK-26239: mccheah commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446376235 Oh I fixed it just now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Matt Cheah >Priority: Major > Fix For: 3.0.0 > > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718062#comment-16718062 ] ASF GitHub Bot commented on SPARK-26322: AmplabJenkins commented on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446376257 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99986/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26239: -- Assignee: Matt Cheah > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Matt Cheah >Priority: Major > Fix For: 3.0.0 > > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26239. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23252 [https://github.com/apache/spark/pull/23252] > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Matt Cheah >Priority: Major > Fix For: 3.0.0 > > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26322) Simplify kafka delegation token sasl.mechanism configuration
[ https://issues.apache.org/jira/browse/SPARK-26322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718058#comment-16718058 ] ASF GitHub Bot commented on SPARK-26322: SparkQA commented on issue #23274: [SPARK-26322][SS] Add spark.kafka.token.sasl.mechanism to ease delegation token configuration. URL: https://github.com/apache/spark/pull/23274#issuecomment-446375942 **[Test build #99986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99986/testReport)** for PR 23274 at commit [`de35aa2`](https://github.com/apache/spark/commit/de35aa2479400c11a48658501a838bebd83553cd). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Simplify kafka delegation token sasl.mechanism configuration > > > Key: SPARK-26322 > URL: https://issues.apache.org/jira/browse/SPARK-26322 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > When Kafka delegation token obtained, SCRAM sasl.mechanism has to be > configured for authentication. This can be configured on the related > source/sink which is inconvenient from user perspective. Such granularity is > not required and this configuration can be implemented with one central > parameter. > Kafka now supports 2 SCRAM related sasl.mechanism: > - SCRAM-SHA-256 > - SCRAM-SHA-512 > and these are configured on brokers, which makes this configuration global. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26203) Benchmark performance of In and InSet expressions
[ https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718056#comment-16718056 ] ASF GitHub Bot commented on SPARK-26203: dongjoon-hyun commented on a change in pull request #23291: [SPARK-26203][SQL] Benchmark performance of In and InSet expressions URL: https://github.com/apache/spark/pull/23291#discussion_r240804589 ## File path: sql/core/src/test/scala/org/apache/spark/sql/InSetBenchmark.scala ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.catalyst.expressions.In +import org.apache.spark.sql.catalyst.expressions.InSet +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark +import org.apache.spark.sql.functions.{array, struct} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +/** + * A benchmark that compares the performance of [[In]] and [[InSet]] expressions. + * + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/InSetBenchmark-results.txt". + * }}} + */ +object InSetBenchmark extends SqlBasedBenchmark { + + import spark.implicits._ + + def byteBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems bytes" +val values = (1 to numItems).map(v => s"CAST($v AS tinyint)") +val df = spark.range(1, numRows).select($"id".cast(ByteType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def shortBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems shorts" +val values = (1 to numItems).map(v => s"CAST($v AS smallint)") +val df = spark.range(1, numRows).select($"id".cast(ShortType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def intBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems ints" +val values = 1 to numItems +val df = spark.range(1, numRows).select($"id".cast(IntegerType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def longBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems longs" +val values = (1 to numItems).map(v => s"${v}L") +val df = spark.range(1, numRows).toDF("id") +benchmark(name, df, values, numRows, minNumIters) + } + + def floatBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems floats" +val values = (1 to numItems).map(v => s"CAST($v AS float)") +val df = spark.range(1, numRows).select($"id".cast(FloatType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def doubleBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems doubles" +val values = 1.0 to numItems by 1.0 +val df = spark.range(1, numRows).select($"id".cast(DoubleType)) +benchmark(name, df, values, numRows, minNumIters) + } + + def smallDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems small decimals" +val values = (1 to numItems).map(v => s"CAST($v AS decimal(12, 1))") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(12, 1))) +benchmark(name, df, values, numRows, minNumIters) + } + + def largeDecimalBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems large decimals" +val values = (1 to numItems).map(v => s"9223372036854775812.10539$v") +val df = spark.range(1, numRows).select($"id".cast(DecimalType(30, 7))) +benchmark(name, df, values, numRows, minNumIters) + } + + def stringBenchmark(numItems: Int, numRows: Long, minNumIters: Int): Benchmark = { +val name = s"$numItems strings" +val values = (1 to numItems).map(n => s"'$n'") +val df = spark.range(1,
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718053#comment-16718053 ] ASF GitHub Bot commented on SPARK-26239: vanzin commented on a change in pull request #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#discussion_r240802787 ## File path: core/src/main/scala/org/apache/spark/SecurityManager.scala ## @@ -367,11 +371,18 @@ private[spark] class SecurityManager( case _ => require(sparkConf.contains(SPARK_AUTH_SECRET_CONF), - s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config.") + s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config") Review comment: Undo this change. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org