[jira] [Created] (SPARK-5408) MaxPermGen is ignored by ExecutorRunner and DriverRunner
Jacek Lewandowski created SPARK-5408: Summary: MaxPermGen is ignored by ExecutorRunner and DriverRunner Key: SPARK-5408 URL: https://issues.apache.org/jira/browse/SPARK-5408 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Jacek Lewandowski ExecutorRunner and DriverRunner uses CommandUtils to build the command which runs executor or driver. The problem is that it has hardcoded {{-XX:MaxPermGen=128m}} and uses it regardless it is specified in extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner
[ https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Lewandowski updated SPARK-5408: - Summary: MaxPermSize is ignored by ExecutorRunner and DriverRunner (was: MaxPermGen is ignored by ExecutorRunner and DriverRunner) > MaxPermSize is ignored by ExecutorRunner and DriverRunner > - > > Key: SPARK-5408 > URL: https://issues.apache.org/jira/browse/SPARK-5408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jacek Lewandowski > > ExecutorRunner and DriverRunner uses CommandUtils to build the command which > runs executor or driver. The problem is that it has hardcoded > {{-XX:MaxPermGen=128m}} and uses it regardless it is specified in > extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner
[ https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Lewandowski updated SPARK-5408: - Description: ExecutorRunner and DriverRunner uses CommandUtils to build the command which runs executor or driver. The problem is that it has hardcoded {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in extraJavaOpts or not. was: ExecutorRunner and DriverRunner uses CommandUtils to build the command which runs executor or driver. The problem is that it has hardcoded {{-XX:MaxPermGen=128m}} and uses it regardless it is specified in extraJavaOpts or not. > MaxPermSize is ignored by ExecutorRunner and DriverRunner > - > > Key: SPARK-5408 > URL: https://issues.apache.org/jira/browse/SPARK-5408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jacek Lewandowski > > ExecutorRunner and DriverRunner uses CommandUtils to build the command which > runs executor or driver. The problem is that it has hardcoded > {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in > extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner
[ https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291581#comment-14291581 ] Apache Spark commented on SPARK-5408: - User 'jacek-lewandowski' has created a pull request for this issue: https://github.com/apache/spark/pull/4202 > MaxPermSize is ignored by ExecutorRunner and DriverRunner > - > > Key: SPARK-5408 > URL: https://issues.apache.org/jira/browse/SPARK-5408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jacek Lewandowski > > ExecutorRunner and DriverRunner uses CommandUtils to build the command which > runs executor or driver. The problem is that it has hardcoded > {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in > extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner
[ https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Lewandowski updated SPARK-5408: - Fix Version/s: 1.2.1 1.3.0 > MaxPermSize is ignored by ExecutorRunner and DriverRunner > - > > Key: SPARK-5408 > URL: https://issues.apache.org/jira/browse/SPARK-5408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jacek Lewandowski > Fix For: 1.3.0, 1.2.1 > > > ExecutorRunner and DriverRunner uses CommandUtils to build the command which > runs executor or driver. The problem is that it has hardcoded > {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in > extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner
[ https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291583#comment-14291583 ] Apache Spark commented on SPARK-5408: - User 'jacek-lewandowski' has created a pull request for this issue: https://github.com/apache/spark/pull/4203 > MaxPermSize is ignored by ExecutorRunner and DriverRunner > - > > Key: SPARK-5408 > URL: https://issues.apache.org/jira/browse/SPARK-5408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jacek Lewandowski > Fix For: 1.3.0, 1.2.1 > > > ExecutorRunner and DriverRunner uses CommandUtils to build the command which > runs executor or driver. The problem is that it has hardcoded > {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in > extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner
[ https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291585#comment-14291585 ] Jacek Lewandowski commented on SPARK-5408: -- Can anybody take a look? > MaxPermSize is ignored by ExecutorRunner and DriverRunner > - > > Key: SPARK-5408 > URL: https://issues.apache.org/jira/browse/SPARK-5408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Jacek Lewandowski > Fix For: 1.3.0, 1.2.1 > > > ExecutorRunner and DriverRunner uses CommandUtils to build the command which > runs executor or driver. The problem is that it has hardcoded > {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in > extraJavaOpts or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
[ https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5406: -- Description: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue thus, n < 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. was: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation: val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue thus, n < 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. > LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound > - > > Key: SPARK-5406 > URL: https://issues.apache.org/jira/browse/SPARK-5406 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke > brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's > implementation > (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala): > val workSize = ( 3 > * scala.math.min(m, n) > * scala.math.min(m, n) > + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) > * scala.math.min(m, n) + 4 * scala.math.min(m, n)) > ) > val work = new Array[Double](workSize) > as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue > thus, n < 17515. > This jira is only the first step. If possbile, I hope spark can handle matrix > computation up to 80K * 80K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
[ https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5406: -- Description: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation ( https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala ): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue thus, n < 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. was: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue thus, n < 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. > LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound > - > > Key: SPARK-5406 > URL: https://issues.apache.org/jira/browse/SPARK-5406 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke > brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's > implementation > ( > https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala >): > val workSize = ( 3 > * scala.math.min(m, n) > * scala.math.min(m, n) > + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) > * scala.math.min(m, n) + 4 * scala.math.min(m, n)) > ) > val work = new Array[Double](workSize) > as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue > thus, n < 17515. > This jira is only the first step. If possbile, I hope spark can handle matrix > computation up to 80K * 80K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
[ https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291605#comment-14291605 ] Apache Spark commented on SPARK-5300: - User 'ehiggs' has created a pull request for this issue: https://github.com/apache/spark/pull/4204 > Spark loads file partitions in inconsistent order on native filesystems > --- > > Key: SPARK-5300 > URL: https://issues.apache.org/jira/browse/SPARK-5300 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, EXT4, for example. >Reporter: Ewan Higgs > > Discussed on user list in April 2014: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html > And on dev list January 2015: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html > When using a file system which isn't HDFS, file partitions ('part-0, > part-1', etc.) are not guaranteed to load in the same order. This means > previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291710#comment-14291710 ] Kushal Datta commented on SPARK-3789: - Hi Ameet, I have created the first pull request: https://github.com/apache/spark/pull/4205 This one does not contain Pregel API for graphs which I will create a pull request within this week. Thanks, -Kushal. > Python bindings for GraphX > -- > > Key: SPARK-3789 > URL: https://issues.apache.org/jira/browse/SPARK-3789 > Project: Spark > Issue Type: New Feature > Components: GraphX, PySpark >Reporter: Ameet Talwalkar >Assignee: Kushal Datta > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291724#comment-14291724 ] Murat Eken commented on SPARK-2389: --- +1. We're using a Spark cluster as a real-time query engine, and unfortunately we're running into the same issues as Robert mentions. Although Spark provides a plethora of solutions when it comes to making its cluster fault-tolerant and resilient, we need the same resilience for the front layer, from where the Spark cluster is accessed; meaning multiple instances of Spark clients, hence multiple SparkContexts from those clients connecting to the same cluster with the same computing power. Performance is crucial for us, hence our choice for caching the data in memory and utilizing the full hardware resources in the executors. Alternative solutions, such as using Tachyon for the data, and restarting executors for each query just don't give the same performance. We're looking into using https://github.com/spark-jobserver/spark-jobserver but that's not a proper solution as we still would have the jobserver as a single point of failure in our setup, which is a problem for us. I'd appreciate it if a Spark developer could give some information about the feasibility of this change request; if this turns out to be difficult or even impossible due to the choices made in the architecture, it would be good to know that so that we can consider our alternatives. > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291737#comment-14291737 ] Sean Owen commented on SPARK-2389: -- Why can't N front-ends talk to a process built around one long-running Spark app? I think that's what the OP is talking about, and can be done right now. One Spark app having many contexts doesn't quite make sense as an app is a SparkContext. But [~meken] are you really talking about HA for the driver? > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5399) tree Losses strings should match loss names
[ https://issues.apache.org/jira/browse/SPARK-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291741#comment-14291741 ] Apache Spark commented on SPARK-5399: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/4206 > tree Losses strings should match loss names > --- > > Key: SPARK-5399 > URL: https://issues.apache.org/jira/browse/SPARK-5399 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0, 1.2.1 >Reporter: Joseph K. Bradley >Priority: Minor > > tree.loss.Losses.fromString expects certain String names for losses. These > do not match the names of the loss classes but should. I believe these > strings were the original names of the losses, and we forgot to correct the > strings when we renamed the losses. > Currently: > {code} > case "leastSquaresError" => SquaredError > case "leastAbsoluteError" => AbsoluteError > case "logLoss" => LogLoss > {code} > Proposed: > {code} > case "SquaredError" => SquaredError > case "AbsoluteError" => AbsoluteError > case "LogLoss" => LogLoss > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291755#comment-14291755 ] Murat Eken commented on SPARK-2389: --- Yes [~sowen], it's about HA for the driver. Our approach is to have a single app that's responsible for initializing the cache at start up (quite expensive) and then serve queries on that cached data (very fast). When you mention "N front-ends talking to a process built around one long running Spark app that can be done right now", are you referring to something like the spark-jobserver (or any alternative) I mentioned? If yes, the problem with that is the single point of failure, as we're moving that from the driver to the jobserver instance. Or is there something else we've missed? > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291764#comment-14291764 ] Robert Stupp commented on SPARK-2389: - [~srowen] that *one long-running* Spark app is the problem. It's a SPOF and it does not scale. It would great to have *some* Spark apps sharing the same data set (thus reducing load to backend data stores and benefit from RDD caching). The "real" clients could then talk (via REST or whatever the "real" application does) to these Spark apps. I don't have anything in mind regarding a "shared sessions" - I'd like to "just" have multiple spark clients access the same RDDs. As [~meken] points out, you have to preload the cache first (which is expensive) before you can use it. (I don't mind to have that cached data considered immutable.) > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5407) No 1.2 AMI available for ec2
[ https://issues.apache.org/jira/browse/SPARK-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Håkan Jonsson closed SPARK-5407. Resolution: Invalid Error on my side. > No 1.2 AMI available for ec2 > > > Key: SPARK-5407 > URL: https://issues.apache.org/jira/browse/SPARK-5407 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.2.0 >Reporter: Håkan Jonsson > > When I try to launch a standalone cluster on EC2 using the scripts in the ec2 > directory for Spark 1.2, (./spark-ec2 -k spark -i k.pem launch my12), I get > the following error: > Could not resolve AMI at: > https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm > It seems there is not yet any AMI available on EC2 for Spark 1.2. > This works well for Spark 1.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291788#comment-14291788 ] Sean Owen commented on SPARK-2389: -- Yes, the SPOF problem makes sense. It doesn't seem to be what this JIRA was about though, which seems to be what the jobserver-style approach addresses. That aside, why doesn't it scale? because of work that needs to be done on the driver? You can of course still run a bunch of drivers, just not one per client. The preloading cache issue is what off-heap caching in Tachyon or HDFS is supposed to ameliorate. > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291795#comment-14291795 ] Murat Eken commented on SPARK-2389: --- [~sowen], I think Robert is talking about fault tolerance when he mentions scalability. Anyway, as I mentioned in my original comment, Tachyon is not an option, at least for us, due to interprocess serialization/deserialization costs. Although we haven't tried HDFS, but I would be surprised if that performed differently. > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291797#comment-14291797 ] Robert Stupp commented on SPARK-2389: - bq. That aside, why doesn't it scale? Simply because it's just a single Spark client. If that machine's at its limit for whatever reason (VM memory, OS resources, CPU, network, ...), that's it. Sure, you can run multiple drivers - but each has its own, private set of data. IMO separate preloading is nice for some applications. But data is usually not immutable. By example: * Imagine an application that provides offers for flights worldwide. It's a huge amount of data and a huge amount of processing. It cannot be simply preloaded - prices for tickets vary from minute to minute based on booking status etc etc etc * Overall data set is quite big * Overall load is too big for a single driver to handle - imagine thousands of offer requests per second * Failure of a single driver is an absolute no-go * All clients have to access the same set of data * Preloading is just impossible during runtime (just at initial deployment) So - a suitable approach would be to have: * a Spark cluster holding all the RDDs and doing all offer and booking related operations * a set of Spark clients to "abstract" Spark from the rest of the application * a huge number of non-uniform frontend clients (could be web app servers, rich clients, SOAP / REST frontends) * everything (except the data) stateless > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-665) Create RPM packages for Spark
[ https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291799#comment-14291799 ] Christian Tzolov commented on SPARK-665: I've looked at the JRPM maven plugin but unlike the jdeb one, JRPM depends on native rpm libraries (e.g. it is platform dependent) My somewhat pragmatic approach to build Spark RPM is to reuse the existing deb package and convert it into rpm using the 'alien' tool. I've wrapped the spark-rpm build pipeline into a parameterizable docker container: https://registry.hub.docker.com/u/tzolov/apache-spark-build-pipeline > Create RPM packages for Spark > - > > Key: SPARK-665 > URL: https://issues.apache.org/jira/browse/SPARK-665 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > This could be doable with the JRPM Maven plugin, similar to how we make > Debian packages now, but I haven't looked into it. The plugin is described at > http://jrpm.sourceforge.net. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291798#comment-14291798 ] Robert Stupp commented on SPARK-2389: - bq. fault tolerance when he mentions scalability both play well together in a stateless application ;) > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291804#comment-14291804 ] Sean Owen commented on SPARK-2389: -- Yes, makes sense. Maxing out one driver isn't an issue since you can have many drivers (or push work into the cluster). The issue is really that each driver then has its own RDDs, and if you need 100s of drivers to keep up, that just won't work. (Although then I'd question how so much work is being done on the Spark driver?) In theory the redundancy of all those RDDs is what HDFS caching and Tachyon could in theory help with, although those help share outside Spark. Whether that works for a particular use case right now is a different question, although I suspect it makes more sense to make those work than start yet another solution. What you are describing -- mutating lots shared in-memory state -- doesn't sound like a problem Spark helps solve per se. That is, it doesn't sound like work that has to live in a Spark driver program, even if it needs to ask a Spark driver-based service for some results. Naturally you know your problem better than I, but I am wondering if the answer here isn't just using Spark differently, for what it's for. > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5303) applySchema returns NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mauro Pirrone closed SPARK-5303. Resolution: Not a Problem > applySchema returns NullPointerException > > > Key: SPARK-5303 > URL: https://issues.apache.org/jira/browse/SPARK-5303 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Mauro Pirrone > > The following code snippet returns NullPointerException: > val result = . > > val rows = result.take(10) > val rowRdd = SparkManager.getContext().parallelize(rows, 1) > val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, > result.schema) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147) > at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) > at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168) > at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216) > at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53) > at scala.collection.immutable.List.hashCode(List.scala:84) > at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) > at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63) > at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210) > at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172) > at > org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58) > at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210) > at > scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398) > at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39) > at > scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.get(HashMap.scala:69) > at > scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187) > at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) > at > scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329) > at > scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) > at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44) > at > org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) > at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135) > at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5409) Broken link in documentation
Mauro Pirrone created SPARK-5409: Summary: Broken link in documentation Key: SPARK-5409 URL: https://issues.apache.org/jira/browse/SPARK-5409 Project: Spark Issue Type: Documentation Reporter: Mauro Pirrone Priority: Minor https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html "See the API docs and the example." Link to example is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291818#comment-14291818 ] Robert Stupp commented on SPARK-2389: - [~srowen] yes, the problem is that drivers cannot share RDDs. IMHO there are a lot of valid scenarios that can benefit from multiple drivers using shared RDDs. > globally shared SparkContext / shared Spark "application" > - > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5409) Broken link in documentation
[ https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291823#comment-14291823 ] Sean Owen commented on SPARK-5409: -- Should just be https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala Open a PR; this probably doesn't even need a JIRA. > Broken link in documentation > > > Key: SPARK-5409 > URL: https://issues.apache.org/jira/browse/SPARK-5409 > Project: Spark > Issue Type: Documentation >Reporter: Mauro Pirrone >Priority: Minor > > https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html > "See the API docs and the example." > Link to example is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5324) Results of describe can't be queried
[ https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291987#comment-14291987 ] Yanbo Liang commented on SPARK-5324: [~marmbrus] I have pull a request for this issue which will implement "DESCRIBE [FORMATTED] [db_name.]table_name" command for SQLContext. Meanwhile, it need to make the least effect on the corresponding command output of HiveContext. And I think other metadata operation command like "show databases/tables, analyze, explain" can also leverage this scenario. Can you assign this to me? > Results of describe can't be queried > > > Key: SPARK-5324 > URL: https://issues.apache.org/jira/browse/SPARK-5324 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Michael Armbrust > > {code} > sql("DESCRIBE TABLE test").registerTempTable("describeTest") > sql("SELECT * FROM describeTest").collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4430) Apache RAT Checks fail spuriously on test files
[ https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4430. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen > Apache RAT Checks fail spuriously on test files > --- > > Key: SPARK-4430 > URL: https://issues.apache.org/jira/browse/SPARK-4430 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 >Reporter: Ryan Williams >Assignee: Sean Owen > Fix For: 1.3.0 > > > Several of my recent runs of {{./dev/run-tests}} have failed quickly due to > Apache RAT checks, e.g.: > {code} > $ ./dev/run-tests > = > Running Apache RAT checks > = > Could not find Apache license headers in the following files: > !? > /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8 > !? > /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9 > [error] Got a return code of 1 on line 114 of the run-tests script. > {code} > I think it's fair to say that these are not useful errors for {{run-tests}} > to crash on. Ideally we could tell the linter which files we care about > having it lint and which we don't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5324) Results of describe can't be queried
[ https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292008#comment-14292008 ] Yanbo Liang commented on SPARK-5324: https://github.com/apache/spark/pull/4207 > Results of describe can't be queried > > > Key: SPARK-5324 > URL: https://issues.apache.org/jira/browse/SPARK-5324 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Michael Armbrust > > {code} > sql("DESCRIBE TABLE test").registerTempTable("describeTest") > sql("SELECT * FROM describeTest").collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3852) Document spark.driver.extra* configs
[ https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3852. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen Target Version/s: (was: 1.2.0) > Document spark.driver.extra* configs > > > Key: SPARK-3852 > URL: https://issues.apache.org/jira/browse/SPARK-3852 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Sean Owen > Fix For: 1.3.0 > > > They are not documented... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5410) Error parsing scientific notation in a select statement
Hugo Ferrira created SPARK-5410: --- Summary: Error parsing scientific notation in a select statement Key: SPARK-5410 URL: https://issues.apache.org/jira/browse/SPARK-5410 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hugo Ferrira I am using the Cassandra DB and am attempting a select through the Spark SQL interface. SELECT * from key_value WHERE f2 < 2.2E10 And get the following error (no error if I remove the E10): [info] - should be able to select a subset of applicable features *** FAILED *** [info] java.lang.RuntimeException: [1.39] failure: ``UNION'' expected but identifier E10 found [info] [info] SELECT * from key_value WHERE f2 < 2.2E10 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292048#comment-14292048 ] Apache Spark commented on SPARK-5355: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4208 > SparkConf is not thread-safe > > > Key: SPARK-5355 > URL: https://issues.apache.org/jira/browse/SPARK-5355 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0, 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0, 1.2.1 > > > The SparkConf is not thread-safe, but is accessed by many threads. The > getAll() could return parts of the configs if another thread is access it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-595) Document "local-cluster" mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292052#comment-14292052 ] Vladimir Grigor commented on SPARK-595: --- +1 for reopen > Document "local-cluster" mode > - > > Key: SPARK-595 > URL: https://issues.apache.org/jira/browse/SPARK-595 > Project: Spark > Issue Type: New Feature > Components: Documentation >Affects Versions: 0.6.0 >Reporter: Josh Rosen >Priority: Minor > > The 'Spark Standalone Mode' guide describes how to manually launch a > standalone cluster, which can be done locally for testing, but it does not > mention SparkContext's `local-cluster` option. > What are the differences between these approaches? Which one should I prefer > for local testing? Can I still use the standalone web interface if I use > 'local-cluster' mode? > It would be useful to document this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5162) Python yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292069#comment-14292069 ] Vladimir Grigor commented on SPARK-5162: I second [~jared.holmb...@orchestro.com] [~lianhuiwang] thank you! I'm going to try your PR. Related issue Even with this PR, there will be problem using Yarn in cluster mode on Amazon EMR. Normally one submits yarn "jobs" via API or aws command line utility, so paths to files are evaluated later at some remote host, hence files are not found. Currently Spark does not support non-local files. One idea would be to add support for non-local (python) files, eg: if file is not local it will be downloaded and made available locally. Something similar to "Distributed Cache" described at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html So following code would work: {code} aws emr add-steps --cluster-id "j-XYWIXMD234" \ --steps Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE {code} What do you think? What is your way to run batch python spark scripts on Yarn in Amazon? > Python yarn-cluster mode > > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN >Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292121#comment-14292121 ] Mark Khaitman commented on SPARK-5395: -- Having the same issue in standalone deployment mode. A single spark-submitted job is spawning a ton of pyspark.daemon instances and depleting the cluster memory even though the appropriate environment variables have been set. > Large number of Python workers causing resource depletion > - > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: AWS ElasticMapReduce >Reporter: Sven Krasser > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_30] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_30 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292176#comment-14292176 ] Joseph K. Bradley commented on SPARK-5400: -- I agree this could be done either way: Algorithm[Model] or Model[Algorithm]. For users, exposing the model type may be easiest; a person who is new to ML and wants to do some clustering will know the name of a clustering model (KMeans, GMM) but may not want to worry about picking an optimization algorithm. So I'd vote for Model[Algorithm]. That said, internally, I agree that Algorithm[Model] would be handy for generalizing. We could do the combination by having an internal LearningState class: {code} class GaussianMixture { def setOptimizer // once we have more than 1 optimization method def run = { val opt = new EM(new GMMLearningState(this)) ... } } private[mllib] GMMLearningState extends OurModelAbstraction { def this(gm: GaussianMixture) = this(...) } class EM(model: OurModelAbstraction) {code} > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop
[ https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292303#comment-14292303 ] Brennon York commented on SPARK-794: [~joshrosen] How is this PR holding up? I haven't seen any issues on the dev board. Think we can close this JIRA ticket? Trying to help prune the JIRA tree :) > Remove sleep() in ClusterScheduler.stop > --- > > Key: SPARK-794 > URL: https://issues.apache.org/jira/browse/SPARK-794 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Matei Zaharia > Labels: backport-needed > Fix For: 1.3.0 > > > This temporary change made a while back slows down the unit tests quite a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-595) Document "local-cluster" mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reopened SPARK-595: -- I've re-opened this issue. Folks are using the API in the wild and we're not going to break compatibility for it, so we should document it. > Document "local-cluster" mode > - > > Key: SPARK-595 > URL: https://issues.apache.org/jira/browse/SPARK-595 > Project: Spark > Issue Type: New Feature > Components: Documentation >Affects Versions: 0.6.0 >Reporter: Josh Rosen >Priority: Minor > > The 'Spark Standalone Mode' guide describes how to manually launch a > standalone cluster, which can be done locally for testing, but it does not > mention SparkContext's `local-cluster` option. > What are the differences between these approaches? Which one should I prefer > for local testing? Can I still use the standalone web interface if I use > 'local-cluster' mode? > It would be useful to document this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292370#comment-14292370 ] Imran Rashid commented on SPARK-3644: - [~joshrosen] Hi Josh, I've got time to implement this now. You can assign to me if you like (or let me know if there is something else in the works ...) > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Reporter: Josh Rosen > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, how > the API will be documented and tested, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
[ https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5236: -- Description: {code} 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) at org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375) at org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434) at parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) ... 27 more {code} was: 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.Abstr
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292399#comment-14292399 ] Dmitriy Lyubimov commented on SPARK-5226: - All attempts to parallelize dbscan in literature lately (or similar DeLiClu type of things) i read about include partitioning the task into smaller subtasks, solving each on individual level and merging it all back (see MR.Scan paper for example). Merging is of course is the new and the tricky thing. As far as i understand, they all pretty much have limitations to reduce scope to euclidean distances and captitalize on notions of euclidean geometry resulting from that, in order to solve partition and merge problems. Which substantially reduces attractiveness of general algorithm. However, the naive straightforward port of simple DBScan algorithm is not terribly practical for big data because of total complexity of the problem (or impracticality of building something like huge distributed R-tree index system on shared-nothing programming models). > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292415#comment-14292415 ] Xuefu Zhang commented on SPARK-2688: #1 above is exactly what Hive needs badly. > Need a way to run multiple data pipeline concurrently > - > > Key: SPARK-2688 > URL: https://issues.apache.org/jira/browse/SPARK-2688 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Xuefu Zhang > > Suppose we want to do the following data processing: > {code} > rdd1 -> rdd2 -> rdd3 >| -> rdd4 >| -> rdd5 >\ -> rdd6 > {code} > where -> represents a transformation. rdd3 to rrdd6 are all derived from an > intermediate rdd2. We use foreach(fn) with a dummy function to trigger the > execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> > rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be > recomputed. This is very inefficient. Ideally, we should be able to trigger > the execution the whole graph and reuse rdd2, but there doesn't seem to be a > way doing so. Tez already realized the importance of this (TEZ-391), so I > think Spark should provide this too. > This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5339) build/mvn doesn't work because of invalid URL for maven's tgz.
[ https://issues.apache.org/jira/browse/SPARK-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5339. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Kousuke Saruta > build/mvn doesn't work because of invalid URL for maven's tgz. > -- > > Key: SPARK-5339 > URL: https://issues.apache.org/jira/browse/SPARK-5339 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Blocker > Fix For: 1.3.0 > > > build/mvn will automatically download tarball of maven. But currently, the > URL is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292431#comment-14292431 ] Sean Owen commented on SPARK-2688: -- As [~irashid] says, #1 is just syntactic sugar on what you can do already in Spark. I'm not clear how something can need this functionality badly, then. Either it's not blocking anything, really, and let's see that, or let's discuss what beyond #1 is actually needed. What I think people want is a miniature "push-based evaluation" method inside of Spark's pull-based DAG evaluation: force evaluation of N children of 1 parent at once. The outcome of a sidebar I had with Sandy on this was that it's probably a) fraught with gotchas, given the push-vs-pull mismatch, but not impossible, and b) would force the children to be persisted in the general case, with possible optimizations in other special cases. Is that the kind of thing Hive on Spark needs, and if so can we hear a concrete elaboration of an example of this, so we can compare with what's possible now? I still sense there's a mismatch between the perception and reality of what's possible with the current API. Hence, there may be some really good news here. > Need a way to run multiple data pipeline concurrently > - > > Key: SPARK-2688 > URL: https://issues.apache.org/jira/browse/SPARK-2688 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Xuefu Zhang > > Suppose we want to do the following data processing: > {code} > rdd1 -> rdd2 -> rdd3 >| -> rdd4 >| -> rdd5 >\ -> rdd6 > {code} > where -> represents a transformation. rdd3 to rrdd6 are all derived from an > intermediate rdd2. We use foreach(fn) with a dummy function to trigger the > execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> > rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be > recomputed. This is very inefficient. Ideally, we should be able to trigger > the execution the whole graph and reuse rdd2, but there doesn't seem to be a > way doing so. Tez already realized the importance of this (TEZ-391), so I > think Spark should provide this too. > This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292448#comment-14292448 ] Xuefu Zhang commented on SPARK-2688: Yeah. We don't need a syntactic suger, but a transformation that just does one pass of the input RDD. This has performance implications on Hive's multi-insert use cases. > Need a way to run multiple data pipeline concurrently > - > > Key: SPARK-2688 > URL: https://issues.apache.org/jira/browse/SPARK-2688 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Xuefu Zhang > > Suppose we want to do the following data processing: > {code} > rdd1 -> rdd2 -> rdd3 >| -> rdd4 >| -> rdd5 >\ -> rdd6 > {code} > where -> represents a transformation. rdd3 to rrdd6 are all derived from an > intermediate rdd2. We use foreach(fn) with a dummy function to trigger the > execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> > rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be > recomputed. This is very inefficient. Ideally, we should be able to trigger > the execution the whole graph and reuse rdd2, but there doesn't seem to be a > way doing so. Tez already realized the importance of this (TEZ-391), so I > think Spark should provide this too. > This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292468#comment-14292468 ] Kushal Datta commented on SPARK-3789: - Hi Ameet, Sorry for asking this question again. What's the release plan for 1.3? -Kushal. > Python bindings for GraphX > -- > > Key: SPARK-3789 > URL: https://issues.apache.org/jira/browse/SPARK-3789 > Project: Spark > Issue Type: New Feature > Components: GraphX, PySpark >Reporter: Ameet Talwalkar >Assignee: Kushal Datta > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292477#comment-14292477 ] Reynold Xin commented on SPARK-3789: Unfortunately this is not going to make it into 1.3, given the code freeze deadline is in 1 week. [~kdatta1978] thanks for working on this. Can you write some high level design document for this change? > Python bindings for GraphX > -- > > Key: SPARK-3789 > URL: https://issues.apache.org/jira/browse/SPARK-3789 > Project: Spark > Issue Type: New Feature > Components: GraphX, PySpark >Reporter: Ameet Talwalkar >Assignee: Kushal Datta > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
Josh Rosen created SPARK-5411: - Summary: Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext Key: SPARK-5411 URL: https://issues.apache.org/jira/browse/SPARK-5411 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen It would be nice if there was a mechanism to allow SparkListeners to be registered through SparkConf settings. This would allow monitoring frameworks to be easily injected into Spark programs without having to modify those programs' code. I propose to introduce a new configuration option, {{spark.extraListeners}}, that allows SparkListeners to be specified in SparkConf and registered before the SparkContext is created. Here is the proposed documentation for the new option: {quote} A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
[ https://issues.apache.org/jira/browse/SPARK-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292494#comment-14292494 ] Apache Spark commented on SPARK-5411: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4111 > Allow SparkListeners to be specified in SparkConf and loaded when creating > SparkContext > --- > > Key: SPARK-5411 > URL: https://issues.apache.org/jira/browse/SPARK-5411 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > It would be nice if there was a mechanism to allow SparkListeners to be > registered through SparkConf settings. This would allow monitoring > frameworks to be easily injected into Spark programs without having to modify > those programs' code. > I propose to introduce a new configuration option, {{spark.extraListeners}}, > that allows SparkListeners to be specified in SparkConf and registered before > the SparkContext is created. Here is the proposed documentation for the new > option: > {quote} > A comma-separated list of classes that implement SparkListener; when > initializing SparkContext, instances of these classes will be created and > registered with Spark's listener bus. If a class has a single-argument > constructor that accepts a SparkConf, that constructor will be called; > otherwise, a zero-argument constructor will be called. If no valid > constructor can be found, the SparkContext creation will fail with an > exception. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292499#comment-14292499 ] Kushal Datta commented on SPARK-3789: - Sure, i will write up the design document. @ Ameet, do you think you can work from another branch which is not on 1.3? > Python bindings for GraphX > -- > > Key: SPARK-3789 > URL: https://issues.apache.org/jira/browse/SPARK-3789 > Project: Spark > Issue Type: New Feature > Components: GraphX, PySpark >Reporter: Ameet Talwalkar >Assignee: Kushal Datta > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4147) Reduce log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4147: --- Assignee: Sean Owen > Reduce log4j dependency > --- > > Key: SPARK-4147 > URL: https://issues.apache.org/jira/browse/SPARK-4147 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Tobias Pfeiffer >Assignee: Sean Owen > > spark-core has a hard dependency on log4j, which shouldn't be necessary since > slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my > sbt file. > Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. > However, removing the log4j dependency fails because in > https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 > a static method of org.apache.log4j.LogManager is accessed *even if* log4j > is not in use. > I guess removing all dependencies on log4j may be a bigger task, but it would > be a great help if the access to LogManager would be done only if log4j use > was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5412) Cannot bind Master to a specific hostname as per the documentation
Alexis Seigneurin created SPARK-5412: Summary: Cannot bind Master to a specific hostname as per the documentation Key: SPARK-5412 URL: https://issues.apache.org/jira/browse/SPARK-5412 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Alexis Seigneurin Documentation on http://spark.apache.org/docs/latest/spark-standalone.html indicates: {quote} You can start a standalone master server by executing: ./sbin/start-master.sh ... the following configuration options can be passed to the master and worker: ... -h HOST, --host HOSTHostname to listen on {quote} The "\-h" or "--host" parameter actually doesn't work with the start-master.sh script. Instead, one has to set the "SPARK_MASTER_IP" variable prior to executing the script. Either the script or the documentation should be updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4147) Reduce log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4147: --- Affects Version/s: 1.2.0 > Reduce log4j dependency > --- > > Key: SPARK-4147 > URL: https://issues.apache.org/jira/browse/SPARK-4147 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Tobias Pfeiffer >Assignee: Sean Owen > > spark-core has a hard dependency on log4j, which shouldn't be necessary since > slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my > sbt file. > Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. > However, removing the log4j dependency fails because in > https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 > a static method of org.apache.log4j.LogManager is accessed *even if* log4j > is not in use. > I guess removing all dependencies on log4j may be a bigger task, but it would > be a great help if the access to LogManager would be done only if log4j use > was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4147) Reduce log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4147. Resolution: Fixed > Reduce log4j dependency > --- > > Key: SPARK-4147 > URL: https://issues.apache.org/jira/browse/SPARK-4147 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Tobias Pfeiffer >Assignee: Sean Owen > Fix For: 1.3.0, 1.2.1 > > > spark-core has a hard dependency on log4j, which shouldn't be necessary since > slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my > sbt file. > Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. > However, removing the log4j dependency fails because in > https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 > a static method of org.apache.log4j.LogManager is accessed *even if* log4j > is not in use. > I guess removing all dependencies on log4j may be a bigger task, but it would > be a great help if the access to LogManager would be done only if log4j use > was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4147) Reduce log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4147: --- Fix Version/s: 1.2.1 1.3.0 > Reduce log4j dependency > --- > > Key: SPARK-4147 > URL: https://issues.apache.org/jira/browse/SPARK-4147 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Tobias Pfeiffer >Assignee: Sean Owen > Fix For: 1.3.0, 1.2.1 > > > spark-core has a hard dependency on log4j, which shouldn't be necessary since > slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my > sbt file. > Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. > However, removing the log4j dependency fails because in > https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 > a static method of org.apache.log4j.LogManager is accessed *even if* log4j > is not in use. > I guess removing all dependencies on log4j may be a bigger task, but it would > be a great help if the access to LogManager would be done only if log4j use > was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-960) JobCancellationSuite "two jobs sharing the same stage" is broken
[ https://issues.apache.org/jira/browse/SPARK-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-960. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4180 [https://github.com/apache/spark/pull/4180] > JobCancellationSuite "two jobs sharing the same stage" is broken > > > Key: SPARK-960 > URL: https://issues.apache.org/jira/browse/SPARK-960 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.8.1, 0.9.0 >Reporter: Mark Hamstra > Fix For: 1.3.0 > > > This test doesn't work as it appears to be intended since the map tasks can > never acquire sem2. The simplest way to demonstrate this is to comment out > f1.cancel() in the future. I believe the intention is that f1 and f2 would > then complete normally; but they won't. Instead, both jobs block, waiting on > sem2. It doesn't look like closing over Semaphores works even in a Local > context, since sem2.hashCode() is different in each of f1, f2 and in the > future containing f1.cancel, so the map jobs never see the sem2.release(10) > in the future. > Instead, the test only completes because all of the stages (the two final > stages and the common dependent stage) get cancelled and aborted. When job > <--> stage dependencies are fully accounted for and job cancellation changed > so that f1.cancel does not abort the common stage, then this test can never > finish since it then becomes hung waiting on sem2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-960) JobCancellationSuite "two jobs sharing the same stage" is broken
[ https://issues.apache.org/jira/browse/SPARK-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-960: - Assignee: Sean Owen > JobCancellationSuite "two jobs sharing the same stage" is broken > > > Key: SPARK-960 > URL: https://issues.apache.org/jira/browse/SPARK-960 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.8.1, 0.9.0 >Reporter: Mark Hamstra >Assignee: Sean Owen > Fix For: 1.3.0 > > > This test doesn't work as it appears to be intended since the map tasks can > never acquire sem2. The simplest way to demonstrate this is to comment out > f1.cancel() in the future. I believe the intention is that f1 and f2 would > then complete normally; but they won't. Instead, both jobs block, waiting on > sem2. It doesn't look like closing over Semaphores works even in a Local > context, since sem2.hashCode() is different in each of f1, f2 and in the > future containing f1.cancel, so the map jobs never see the sem2.release(10) > in the future. > Instead, the test only completes because all of the stages (the two final > stages and the common dependent stage) get cancelled and aborted. When job > <--> stage dependencies are fully accounted for and job cancellation changed > so that f1.cancel does not abort the common stage, then this test can never > finish since it then becomes hung waiting on sem2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-926) spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null
[ https://issues.apache.org/jira/browse/SPARK-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292549#comment-14292549 ] Josh Rosen commented on SPARK-926: -- I think it is; there's now a PR to fix this, since it's also the cause of SPARK-5403: https://github.com/apache/spark/pull/4196 > spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null > --- > > Key: SPARK-926 > URL: https://issues.apache.org/jira/browse/SPARK-926 > Project: Spark > Issue Type: New Feature > Components: EC2 >Affects Versions: 0.8.0 >Reporter: Shay Seng >Priority: Trivial > > The know host file in the local machine gets all kinds of crap after a few > cluster launches. When SSHing, or SCPing, please add "-o > UserKnowHostFile=/dev/null" > Also remove the -t option from SSH, and only add in when necessary - to > reduce chatter on console. > e.g. > # Copy a file to a given host through scp, throwing an exception if scp fails > def scp(host, opts, local_file, dest_file): > subprocess.check_call( > "scp -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i > %s '%s' '%s@%s:%s'" % > (opts.identity_file, local_file, opts.user, host, dest_file), > shell=True) > # Run a command on a host through ssh, retrying up to two times > # and then throwing an exception if ssh continues to fail. > def ssh(host, opts, command, sshopts=""): > tries = 0 > while True: > try: > # removed -t option from ssh command, not sure why it is required all > the time. > return subprocess.check_call( > "ssh %s -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null > -i %s %s@%s '%s'" % > (sshopts, opts.identity_file, opts.user, host, command), shell=True) > except subprocess.CalledProcessError as e: > if (tries > 2): > raise e > print "Couldn't connect to host {0}, waiting 30 seconds".format(e) > time.sleep(30) > tries = tries + 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292602#comment-14292602 ] Ameet Talwalkar commented on SPARK-3789: Unfortunately not -- I plan to use a standard release of Spark for my MOOC. On Mon, Jan 26, 2015 at 2:10 PM, Kushal Datta (JIRA) > Python bindings for GraphX > -- > > Key: SPARK-3789 > URL: https://issues.apache.org/jira/browse/SPARK-3789 > Project: Spark > Issue Type: New Feature > Components: GraphX, PySpark >Reporter: Ameet Talwalkar >Assignee: Kushal Datta > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292630#comment-14292630 ] Sven Krasser commented on SPARK-5395: - [~mkman84], do you also see this for both spark.python.worker.reuse false/true? (FWIW, the run I pasted above had reuse disabled.) Also, do you happen to have a small job that can be used as a repro ([~davies] was asking for one, so far I only managed to trigger this condition using production data). > Large number of Python workers causing resource depletion > - > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: AWS ElasticMapReduce >Reporter: Sven Krasser > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_30] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_30 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292668#comment-14292668 ] Mark Khaitman commented on SPARK-5395: -- [~skrasser], I actually only managed to have this reproduced using production data as well (so far). I'll try to write a simple version tomorrow but it seems that it's a mix of both python worker processes not being killed after it's no longer running (causing build up), as well as the python worker exceeding the allocated memory limit. I think it *may* be related to a couple of specific actions such as groupByKey/cogroup, though I'll still need to do some tests to be sure what's causing this. > Large number of Python workers causing resource depletion > - > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: AWS ElasticMapReduce >Reporter: Sven Krasser > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_30] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_30 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292668#comment-14292668 ] Mark Khaitman edited comment on SPARK-5395 at 1/26/15 11:57 PM: [~skrasser], I actually only managed to have this reproduced using production data as well (so far). I'll try to write a simple version tomorrow but it seems that it's a mix of both python worker processes not being killed after it's no longer running (causing build up), as well as the python worker exceeding the allocated memory limit. I think it *may* be related to a couple of specific actions such as groupByKey/cogroup, though I'll still need to do some tests to be sure what's causing this. I should also add that we haven't modified the default for the python.worker.reuse variable, so in our case it should be using the default of True. was (Author: mkman84): [~skrasser], I actually only managed to have this reproduced using production data as well (so far). I'll try to write a simple version tomorrow but it seems that it's a mix of both python worker processes not being killed after it's no longer running (causing build up), as well as the python worker exceeding the allocated memory limit. I think it *may* be related to a couple of specific actions such as groupByKey/cogroup, though I'll still need to do some tests to be sure what's causing this. > Large number of Python workers causing resource depletion > - > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: AWS ElasticMapReduce >Reporter: Sven Krasser > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_30] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_30 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
[ https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5384: - Priority: Minor (was: Critical) > Vectors.sqdist return inconsistent result for sparse/dense vectors when the > vectors have different lengths > -- > > Key: SPARK-5384 > URL: https://issues.apache.org/jira/browse/SPARK-5384 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.1 > Environment: centos, others should be similar >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 1.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > For two vectors of different lengths, Vectors.sqdist would return different > result when the vectors are represented as sparse and dense respectively. > Sample: > val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) > val s2 = new SparseVector(1, Array(0), Array(9.0)) > val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) > val d2 = new DenseVector(Array(9.0)) > println(s1 == d1 && s2 == d2) > println(Vectors.sqdist(s1, s2)) > println(Vectors.sqdist(d1, d2)) > result: > true > 93.0 > 64.0 > More precisely, for the extra part, Vectors.sqdist would include it for > sparse vectors and exclude it for dense vectors. I'll send a PR and we can > have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
[ https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5384: - Affects Version/s: (was: 1.2.1) 1.3.0 > Vectors.sqdist return inconsistent result for sparse/dense vectors when the > vectors have different lengths > -- > > Key: SPARK-5384 > URL: https://issues.apache.org/jira/browse/SPARK-5384 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 1.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > For two vectors of different lengths, Vectors.sqdist would return different > result when the vectors are represented as sparse and dense respectively. > Sample: > val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) > val s2 = new SparseVector(1, Array(0), Array(9.0)) > val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) > val d2 = new DenseVector(Array(9.0)) > println(s1 == d1 && s2 == d2) > println(Vectors.sqdist(s1, s2)) > println(Vectors.sqdist(d1, d2)) > result: > true > 93.0 > 64.0 > More precisely, for the extra part, Vectors.sqdist would include it for > sparse vectors and exclude it for dense vectors. I'll send a PR and we can > have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
[ https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5384: - Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) > Vectors.sqdist return inconsistent result for sparse/dense vectors when the > vectors have different lengths > -- > > Key: SPARK-5384 > URL: https://issues.apache.org/jira/browse/SPARK-5384 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 1.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > For two vectors of different lengths, Vectors.sqdist would return different > result when the vectors are represented as sparse and dense respectively. > Sample: > val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) > val s2 = new SparseVector(1, Array(0), Array(9.0)) > val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) > val d2 = new DenseVector(Array(9.0)) > println(s1 == d1 && s2 == d2) > println(Vectors.sqdist(s1, s2)) > println(Vectors.sqdist(d1, d2)) > result: > true > 93.0 > 64.0 > More precisely, for the extra part, Vectors.sqdist would include it for > sparse vectors and exclude it for dense vectors. I'll send a PR and we can > have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
[ https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5384: - Target Version/s: 1.3.0, 1.2.1 (was: 1.2.1) > Vectors.sqdist return inconsistent result for sparse/dense vectors when the > vectors have different lengths > -- > > Key: SPARK-5384 > URL: https://issues.apache.org/jira/browse/SPARK-5384 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 > Environment: centos, others should be similar >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 1.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > For two vectors of different lengths, Vectors.sqdist would return different > result when the vectors are represented as sparse and dense respectively. > Sample: > val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) > val s2 = new SparseVector(1, Array(0), Array(9.0)) > val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) > val d2 = new DenseVector(Array(9.0)) > println(s1 == d1 && s2 == d2) > println(Vectors.sqdist(s1, s2)) > println(Vectors.sqdist(d1, d2)) > result: > true > 93.0 > 64.0 > More precisely, for the extra part, Vectors.sqdist would include it for > sparse vectors and exclude it for dense vectors. I'll send a PR and we can > have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292688#comment-14292688 ] Xiangrui Meng commented on SPARK-3439: -- [~angellandros] The public API and the complexity analysis are more important than implementation details. What is the expected input and output of the algorithm and how does it scale? > Add Canopy Clustering Algorithm > --- > > Key: SPARK-3439 > URL: https://issues.apache.org/jira/browse/SPARK-3439 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Assignee: Muhammad-Ali A'rabi >Priority: Minor > > The canopy clustering algorithm is an unsupervised pre-clustering algorithm. > It is often used as a preprocessing step for the K-means algorithm or the > Hierarchical clustering algorithm. It is intended to speed up clustering > operations on large data sets, where using another algorithm directly may be > impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292695#comment-14292695 ] Xiangrui Meng commented on SPARK-5400: -- I like `GaussianMixture` better. I don't think it is worth making a generic EM algorithm. It is too general and we won't benefit much from code reuse. > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4587: - Assignee: Joseph K. Bradley > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces
[ https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1856: - Target Version/s: (was: 1.3.0) > Standardize MLlib interfaces > > > Key: SPARK-1856 > URL: https://issues.apache.org/jira/browse/SPARK-1856 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Blocker > > Instead of expanding MLlib based on the current class naming scheme > (ProblemWithAlgorithm), we should standardize MLlib's interfaces that > clearly separate datasets, formulations, algorithms, parameter sets, and > models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces
[ https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1856: - Priority: Critical (was: Blocker) > Standardize MLlib interfaces > > > Key: SPARK-1856 > URL: https://issues.apache.org/jira/browse/SPARK-1856 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > Instead of expanding MLlib based on the current class naming scheme > (ProblemWithAlgorithm), we should standardize MLlib's interfaces that > clearly separate datasets, formulations, algorithms, parameter sets, and > models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Target Version/s: (was: 1.3.0) > Support multi-model training in MLlib > - > > Key: SPARK-1486 > URL: https://issues.apache.org/jira/browse/SPARK-1486 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Burak Yavuz >Priority: Critical > > It is rare in practice to train just one model with a given set of > parameters. Usually, this is done by training multiple models with different > sets of parameters and then select the best based on their performance on the > validation set. MLlib should provide native support for multi-model > training/scoring. It requires decoupling of concepts like problem, > formulation, algorithm, parameter set, and model, which are missing in MLlib > now. MLI implements similar concepts, which we can borrow. There are > different approaches for multi-model training: > 0) Keep one copy of the data, and train models one after another (or maybe in > parallel, depending on the scheduler). > 1) Keep one copy of the data, and train multiple models at the same time > (similar to `runs` in KMeans). > 2) Make multiple copies of the data (still stored distributively), and use > more cores to distribute the work. > 3) Collect the data, make the entire dataset available on workers, and train > one or more models on each worker. > Users should be able to choose which execution mode they want to use. Note > that 3) could cover many use cases in practice when the training data is not > huge, e.g., <1GB. > This task will be divided into sub-tasks and this JIRA is created to discuss > the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4589) ML add-ons to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-4589. Resolution: Duplicate I'm closing this JIRA in favor of the DataFrame API. > ML add-ons to SchemaRDD > --- > > Key: SPARK-4589 > URL: https://issues.apache.org/jira/browse/SPARK-4589 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib, SQL >Reporter: Xiangrui Meng > > One feedback we received from the Pipeline API (SPARK-3530) is about the > boilerplate code in the implementation. We can add more Scala DSL to simplify > the code for the operations we need in ML. Those operations could live under > spark.ml via implicit, or be added to SchemaRDD directly if they are also > useful for general purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees
[ https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5094: - Assignee: Kazuki Taniguchi > Python API for gradient-boosted trees > - > > Key: SPARK-5094 > URL: https://issues.apache.org/jira/browse/SPARK-5094 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Kazuki Taniguchi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3717: - Target Version/s: (was: 1.3.0) > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5321) Add transpose() method to Matrix
[ https://issues.apache.org/jira/browse/SPARK-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5321: - Assignee: Burak Yavuz > Add transpose() method to Matrix > > > Key: SPARK-5321 > URL: https://issues.apache.org/jira/browse/SPARK-5321 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > While we are working on BlockMatrix, it will be nice to add the support to > transpose matrices. .transpose() will just modify a private flag in local > matrices. Operations that follow will be performed based on this flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5114) Should Evaluator be a PipelineStage
[ https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5114: - Summary: Should Evaluator be a PipelineStage (was: Should Evaluator by a PipelineStage) > Should Evaluator be a PipelineStage > --- > > Key: SPARK-5114 > URL: https://issues.apache.org/jira/browse/SPARK-5114 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > Pipelines can currently contain Estimators and Transformers. > Question for debate: Should Pipelines be able to contain Evaluators? > Pros: > * Evaluators take input datasets with particular schema, which should perhaps > be checked before running a Pipeline. > Cons: > * Evaluators do not transform datasets. They produce a scalar (or a few > values), which makes it hard to say how they fit into a Pipeline or a > PipelineModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5413: - Description: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. was: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like batching metrics in TCP and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. > Upgrade "metrics" dependency to 3.1.0 > - > > Key: SPARK-5413 > URL: https://issues.apache.org/jira/browse/SPARK-5413 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > Spark currently uses Coda Hale's metrics library version {{3.0.0}}. > Version {{3.1.0}} includes some useful improvements, like [batching metrics > in > TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] > and supporting Graphite's UDP interface. > I'd like to bump Spark's version to take advantage of these; I'm playing with > GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0
Ryan Williams created SPARK-5413: Summary: Upgrade "metrics" dependency to 3.1.0 Key: SPARK-5413 URL: https://issues.apache.org/jira/browse/SPARK-5413 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like batching metrics in TCP and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5413: - Description: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and [supporting Graphite's UDP interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java]. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. was: Spark currently uses Coda Hale's metrics library version {{3.0.0}}. Version {{3.1.0}} includes some useful improvements, like [batching metrics in TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and supporting Graphite's UDP interface. I'd like to bump Spark's version to take advantage of these; I'm playing with GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. > Upgrade "metrics" dependency to 3.1.0 > - > > Key: SPARK-5413 > URL: https://issues.apache.org/jira/browse/SPARK-5413 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > Spark currently uses Coda Hale's metrics library version {{3.0.0}}. > Version {{3.1.0}} includes some useful improvements, like [batching metrics > in > TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] > and [supporting Graphite's UDP > interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java]. > I'd like to bump Spark's version to take advantage of these; I'm playing with > GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292718#comment-14292718 ] Xiangrui Meng commented on SPARK-4846: -- [~josephtang] Are you working on this issue? If not, do you mind me sending a PR that throwing an exception if vectorSize is beyond limit? > When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: > Requested array size exceeds VM limit" > --- > > Key: SPARK-4846 > URL: https://issues.apache.org/jira/browse/SPARK-4846 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.1, 1.2.0 > Environment: Use Word2Vec to process a corpus(sized 3.5G) with one > partition. > The corpus contains about 300 million words and its vocabulary size is about > 10 million. >Reporter: Joseph Tang >Assignee: Joseph Tang >Priority: Minor > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) > at > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method
Josh Rosen created SPARK-5414: - Summary: Add SparkListener implementation that allows users to receive all listener events in one method Key: SPARK-5414 URL: https://issues.apache.org/jira/browse/SPARK-5414 Project: Spark Issue Type: New Feature Reporter: Josh Rosen Assignee: Josh Rosen Currently, users don't have a very good way to write a SparkListener that receives all SparkListener events and which will be future-compatible (e.g. it will receive events introduced in newer versions of Spark without having to override new methods to process those events). Therefore, I think Spark should include a concrete SparkListener implementation that implements all of the message-handling methods and dispatches all of them to a single {{onEvent}} method. By putting this code in Spark, we isolate users from changes to the listener API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292719#comment-14292719 ] Apache Spark commented on SPARK-5413: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4209 > Upgrade "metrics" dependency to 3.1.0 > - > > Key: SPARK-5413 > URL: https://issues.apache.org/jira/browse/SPARK-5413 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > Spark currently uses Coda Hale's metrics library version {{3.0.0}}. > Version {{3.1.0}} includes some useful improvements, like [batching metrics > in > TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] > and [supporting Graphite's UDP > interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java]. > I'd like to bump Spark's version to take advantage of these; I'm playing with > GraphiteSink and seeing Graphite struggle to ingest all executors' metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4979) Add streaming logistic regression
[ https://issues.apache.org/jira/browse/SPARK-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4979: - Target Version/s: (was: 1.3.0) > Add streaming logistic regression > - > > Key: SPARK-4979 > URL: https://issues.apache.org/jira/browse/SPARK-4979 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Jeremy Freeman >Priority: Minor > > We currently support streaming linear regression and k-means clustering. We > can add support for streaming logistic regression using a strategy similar to > that used in streaming linear regression, applying gradient updates to > batches of data from a DStream, and extending the existing mllib methods with > minor modifications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method
[ https://issues.apache.org/jira/browse/SPARK-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292720#comment-14292720 ] Apache Spark commented on SPARK-5414: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4210 > Add SparkListener implementation that allows users to receive all listener > events in one method > --- > > Key: SPARK-5414 > URL: https://issues.apache.org/jira/browse/SPARK-5414 > Project: Spark > Issue Type: New Feature >Reporter: Josh Rosen >Assignee: Josh Rosen > > Currently, users don't have a very good way to write a SparkListener that > receives all SparkListener events and which will be future-compatible (e.g. > it will receive events introduced in newer versions of Spark without having > to override new methods to process those events). > Therefore, I think Spark should include a concrete SparkListener > implementation that implements all of the message-handling methods and > dispatches all of them to a single {{onEvent}} method. By putting this code > in Spark, we isolate users from changes to the listener API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5415) Upgrade sbt to 0.13.7
Ryan Williams created SPARK-5415: Summary: Upgrade sbt to 0.13.7 Key: SPARK-5415 URL: https://issues.apache.org/jira/browse/SPARK-5415 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor Spark currently uses sbt {{0.13.6}}, which has a regression related to processing parent POM's in Maven projects. {{0.13.7}} does not have this issue (though it's unclear whether it was fixed intentionally), so I'd like to bump up one version. I ran into this while locally building a Spark assembly against a locally-built "metrics" JAR dependency; {{0.13.6}} could not build Spark but {{0.13.7}} worked fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method
[ https://issues.apache.org/jira/browse/SPARK-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5414: -- Component/s: Spark Core > Add SparkListener implementation that allows users to receive all listener > events in one method > --- > > Key: SPARK-5414 > URL: https://issues.apache.org/jira/browse/SPARK-5414 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Currently, users don't have a very good way to write a SparkListener that > receives all SparkListener events and which will be future-compatible (e.g. > it will receive events introduced in newer versions of Spark without having > to override new methods to process those events). > Therefore, I think Spark should include a concrete SparkListener > implementation that implements all of the message-handling methods and > dispatches all of them to a single {{onEvent}} method. By putting this code > in Spark, we isolate users from changes to the listener API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5415) Upgrade sbt to 0.13.7
[ https://issues.apache.org/jira/browse/SPARK-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292723#comment-14292723 ] Sean Owen commented on SPARK-5415: -- FWIW I use 0.13.7 locally and have had no problems. > Upgrade sbt to 0.13.7 > - > > Key: SPARK-5415 > URL: https://issues.apache.org/jira/browse/SPARK-5415 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams >Priority: Minor > > Spark currently uses sbt {{0.13.6}}, which has a regression related to > processing parent POM's in Maven projects. > {{0.13.7}} does not have this issue (though it's unclear whether it was fixed > intentionally), so I'd like to bump up one version. > I ran into this while locally building a Spark assembly against a > locally-built "metrics" JAR dependency; {{0.13.6}} could not build Spark but > {{0.13.7}} worked fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5415) Upgrade sbt to 0.13.7
[ https://issues.apache.org/jira/browse/SPARK-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292722#comment-14292722 ] Apache Spark commented on SPARK-5415: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4211 > Upgrade sbt to 0.13.7 > - > > Key: SPARK-5415 > URL: https://issues.apache.org/jira/browse/SPARK-5415 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams >Priority: Minor > > Spark currently uses sbt {{0.13.6}}, which has a regression related to > processing parent POM's in Maven projects. > {{0.13.7}} does not have this issue (though it's unclear whether it was fixed > intentionally), so I'd like to bump up one version. > I ran into this while locally building a Spark assembly against a > locally-built "metrics" JAR dependency; {{0.13.6}} could not build Spark but > {{0.13.7}} worked fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5409) Broken link in documentation
[ https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5409. -- Resolution: Duplicate Actually this was already fixed > Broken link in documentation > > > Key: SPARK-5409 > URL: https://issues.apache.org/jira/browse/SPARK-5409 > Project: Spark > Issue Type: Documentation >Reporter: Mauro Pirrone >Priority: Minor > > https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html > "See the API docs and the example." > Link to example is broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292750#comment-14292750 ] Kai Sasaki commented on SPARK-5261: --- [~gq] Can you provide us data set? I tried with some patterns of num partitions but cannot reproduced it. > In some cases ,The value of word's vector representation is too big > --- > > Key: SPARK-5261 > URL: https://issues.apache.org/jira/browse/SPARK-5261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Guoqiang Li > > {code} > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(36) > {code} > The average absolute value of the word's vector representation is 60731.8 > {code} > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(1) > {code} > The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292769#comment-14292769 ] Mark Khaitman commented on SPARK-5395: -- This may prove to be useful... I'm watching a presently running spark-submitted job, while watching the pyspark.daemon processes. The framework is permitted to only use 8 cores on each node with the default python worker memory of 512mb per node (not the executor memory which is set to higher than this). Ignoring the exact RDD actions for a moment, it looks like while it transitions from Stage 1 -> Stage 2, it spawned up 8-10 additional pyspark.daemon processes making the box use more cores than it was even allowed to... A few seconds after that, the other 8 processes entered a sleeping state while still holding onto the physical memory it ate up in Stage 1. As soon as Stage 2 finished, practically all of the pyspark.daemons vanished and freed up the memory usage. I was keeping an eye on 2 random nodes and the exact same thing occurred on both. It was also the only currently executing job at the time so there was really no other interference/contention for resources. I will try to provide a bit more detail on the exact transformations/actions occurring between the 2 stages, although I know a PartionBy and cogroup are occurring at the very least without inspecting the spark-submitted code directly. > Large number of Python workers causing resource depletion > - > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: AWS ElasticMapReduce >Reporter: Sven Krasser > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_30] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_30 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource
Ryan Williams created SPARK-5416: Summary: Initialize Executor.threadPool before ExecutorSource Key: SPARK-5416 URL: https://issues.apache.org/jira/browse/SPARK-5416 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor I recently saw some NPEs from [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44] in the first couple seconds of my executors' being initialized. I think that {{ExecutorSource}} was trying to report these metrics before its threadpool was initialized; there are a few LoC between the source being registered ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82]) and the threadpool being initialized ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]). We should initialize the threapool before the ExecutorSource is registered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource
[ https://issues.apache.org/jira/browse/SPARK-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292784#comment-14292784 ] Apache Spark commented on SPARK-5416: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4212 > Initialize Executor.threadPool before ExecutorSource > > > Key: SPARK-5416 > URL: https://issues.apache.org/jira/browse/SPARK-5416 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams >Priority: Minor > > I recently saw some NPEs from > [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44] > in the first couple seconds of my executors' being initialized. > I think that {{ExecutorSource}} was trying to report these metrics before its > threadpool was initialized; there are a few LoC between the source being > registered > ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82]) > and the threadpool being initialized > ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]). > We should initialize the threapool before the ExecutorSource is registered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5417) Remove redundant executor-ID set() call
Ryan Williams created SPARK-5417: Summary: Remove redundant executor-ID set() call Key: SPARK-5417 URL: https://issues.apache.org/jira/browse/SPARK-5417 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor {{spark.executor.id}} no longer [needs to be set in Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79], as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream in [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332]. Might as well remove the redundant set() in Executor.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5417) Remove redundant executor-ID set() call
[ https://issues.apache.org/jira/browse/SPARK-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292790#comment-14292790 ] Apache Spark commented on SPARK-5417: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4213 > Remove redundant executor-ID set() call > --- > > Key: SPARK-5417 > URL: https://issues.apache.org/jira/browse/SPARK-5417 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams >Priority: Minor > > {{spark.executor.id}} no longer [needs to be set in > Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79], > as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream > in > [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332]. > Might as well remove the redundant set() in Executor.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: (was: SparkSQLOnHBase_v2.docx) > HBase as data source to SparkSQL > > > Key: SPARK-3880 > URL: https://issues.apache.org/jira/browse/SPARK-3880 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yan >Assignee: Yan > Attachments: HBaseOnSpark.docx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3562) Periodic cleanup event logs
[ https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292818#comment-14292818 ] Apache Spark commented on SPARK-3562: - User 'viper-kun' has created a pull request for this issue: https://github.com/apache/spark/pull/4214 > Periodic cleanup event logs > --- > > Key: SPARK-3562 > URL: https://issues.apache.org/jira/browse/SPARK-3562 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: xukun > > If we run spark application frequently, it will write many spark event log > into spark.eventLog.dir. After a long time later, there will be many spark > event log that we do not concern in the spark.eventLog.dir.Periodic cleanups > will ensure that logs older than this duration will be forgotten. It is no > need to clean logs by hands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: SparkSQLOnHBase_v2.0.docx > HBase as data source to SparkSQL > > > Key: SPARK-3880 > URL: https://issues.apache.org/jira/browse/SPARK-3880 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yan >Assignee: Yan > Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org