[jira] [Created] (SPARK-5408) MaxPermGen is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Jacek Lewandowski (JIRA)
Jacek Lewandowski created SPARK-5408:


 Summary: MaxPermGen is ignored by ExecutorRunner and DriverRunner
 Key: SPARK-5408
 URL: https://issues.apache.org/jira/browse/SPARK-5408
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Jacek Lewandowski


ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
runs executor or driver. The problem is that it has hardcoded 
{{-XX:MaxPermGen=128m}} and uses it regardless it is specified in extraJavaOpts 
or not. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Jacek Lewandowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Lewandowski updated SPARK-5408:
-
Summary: MaxPermSize is ignored by ExecutorRunner and DriverRunner  (was: 
MaxPermGen is ignored by ExecutorRunner and DriverRunner)

> MaxPermSize is ignored by ExecutorRunner and DriverRunner
> -
>
> Key: SPARK-5408
> URL: https://issues.apache.org/jira/browse/SPARK-5408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jacek Lewandowski
>
> ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
> runs executor or driver. The problem is that it has hardcoded 
> {{-XX:MaxPermGen=128m}} and uses it regardless it is specified in 
> extraJavaOpts or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Jacek Lewandowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Lewandowski updated SPARK-5408:
-
Description: 
ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
runs executor or driver. The problem is that it has hardcoded 
{{-XX:MaxPermSize=128m}} and uses it regardless it is specified in 
extraJavaOpts or not. 


  was:
ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
runs executor or driver. The problem is that it has hardcoded 
{{-XX:MaxPermGen=128m}} and uses it regardless it is specified in extraJavaOpts 
or not. 



> MaxPermSize is ignored by ExecutorRunner and DriverRunner
> -
>
> Key: SPARK-5408
> URL: https://issues.apache.org/jira/browse/SPARK-5408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jacek Lewandowski
>
> ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
> runs executor or driver. The problem is that it has hardcoded 
> {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in 
> extraJavaOpts or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291581#comment-14291581
 ] 

Apache Spark commented on SPARK-5408:
-

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/4202

> MaxPermSize is ignored by ExecutorRunner and DriverRunner
> -
>
> Key: SPARK-5408
> URL: https://issues.apache.org/jira/browse/SPARK-5408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jacek Lewandowski
>
> ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
> runs executor or driver. The problem is that it has hardcoded 
> {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in 
> extraJavaOpts or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Jacek Lewandowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Lewandowski updated SPARK-5408:
-
Fix Version/s: 1.2.1
   1.3.0

> MaxPermSize is ignored by ExecutorRunner and DriverRunner
> -
>
> Key: SPARK-5408
> URL: https://issues.apache.org/jira/browse/SPARK-5408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jacek Lewandowski
> Fix For: 1.3.0, 1.2.1
>
>
> ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
> runs executor or driver. The problem is that it has hardcoded 
> {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in 
> extraJavaOpts or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291583#comment-14291583
 ] 

Apache Spark commented on SPARK-5408:
-

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/4203

> MaxPermSize is ignored by ExecutorRunner and DriverRunner
> -
>
> Key: SPARK-5408
> URL: https://issues.apache.org/jira/browse/SPARK-5408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jacek Lewandowski
> Fix For: 1.3.0, 1.2.1
>
>
> ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
> runs executor or driver. The problem is that it has hardcoded 
> {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in 
> extraJavaOpts or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5408) MaxPermSize is ignored by ExecutorRunner and DriverRunner

2015-01-26 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291585#comment-14291585
 ] 

Jacek Lewandowski commented on SPARK-5408:
--

Can anybody take a look?

> MaxPermSize is ignored by ExecutorRunner and DriverRunner
> -
>
> Key: SPARK-5408
> URL: https://issues.apache.org/jira/browse/SPARK-5408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jacek Lewandowski
> Fix For: 1.3.0, 1.2.1
>
>
> ExecutorRunner and DriverRunner uses CommandUtils to build the command which 
> runs executor or driver. The problem is that it has hardcoded 
> {{-XX:MaxPermSize=128m}} and uses it regardless it is specified in 
> extraJavaOpts or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound

2015-01-26 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5406:
--
Description: 
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala):

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue
thus, n < 17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.


  was:
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation:

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue
thus, n < 17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.



> LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
> -
>
> Key: SPARK-5406
> URL: https://issues.apache.org/jira/browse/SPARK-5406
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke 
> brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's 
> implementation
> (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala):
>   val workSize = ( 3
> * scala.math.min(m, n)
> * scala.math.min(m, n)
> + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
>   * scala.math.min(m, n) + 4 * scala.math.min(m, n))
>   )
>   val work = new Array[Double](workSize)
> as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue
> thus, n < 17515.
> This jira is only the first step. If possbile, I hope spark can handle matrix 
> computation up to 80K * 80K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound

2015-01-26 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5406:
--
Description: 
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation
( 
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala
   ):

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue
thus, n < 17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.


  was:
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala):

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue
thus, n < 17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.



> LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
> -
>
> Key: SPARK-5406
> URL: https://issues.apache.org/jira/browse/SPARK-5406
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke 
> brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's 
> implementation
> ( 
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala
>):
>   val workSize = ( 3
> * scala.math.min(m, n)
> * scala.math.min(m, n)
> + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
>   * scala.math.min(m, n) + 4 * scala.math.min(m, n))
>   )
>   val work = new Array[Double](workSize)
> as a result, column num must satisfy 7 * n * n + 4 * n < Int.MaxValue
> thus, n < 17515.
> This jira is only the first step. If possbile, I hope spark can handle matrix 
> computation up to 80K * 80K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291605#comment-14291605
 ] 

Apache Spark commented on SPARK-5300:
-

User 'ehiggs' has created a pull request for this issue:
https://github.com/apache/spark/pull/4204

> Spark loads file partitions in inconsistent order on native filesystems
> ---
>
> Key: SPARK-5300
> URL: https://issues.apache.org/jira/browse/SPARK-5300
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux, EXT4, for example.
>Reporter: Ewan Higgs
>
> Discussed on user list in April 2014:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html
> And on dev list January 2015:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html
> When using a file system which isn't HDFS, file partitions ('part-0, 
> part-1', etc.) are not guaranteed to load in the same order. This means 
> previously sorted RDDs will be loaded out of order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291710#comment-14291710
 ] 

Kushal Datta commented on SPARK-3789:
-

Hi Ameet,

I have created the first pull request: https://github.com/apache/spark/pull/4205

This one does not contain Pregel API for graphs which I will create a pull 
request within this week.

Thanks,
-Kushal.

> Python bindings for GraphX
> --
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291724#comment-14291724
 ] 

Murat Eken commented on SPARK-2389:
---

+1. We're using a Spark cluster as a real-time query engine, and unfortunately 
we're running into the same issues as Robert mentions. Although Spark provides 
a plethora of solutions when it comes to making its cluster fault-tolerant and 
resilient, we need the same resilience for the front layer, from where the 
Spark cluster is accessed; meaning multiple instances of Spark clients, hence 
multiple SparkContexts from those clients connecting to the same cluster with 
the same computing power.

Performance is crucial for us, hence our choice for caching the data in memory 
and utilizing the full hardware resources in the executors. Alternative 
solutions, such as using Tachyon for the data, and restarting executors for 
each query just don't give the same performance. We're looking into using 
https://github.com/spark-jobserver/spark-jobserver but that's not a proper 
solution as we still would have the jobserver as a single point of failure in 
our setup, which is a problem for us.

I'd appreciate it if a Spark developer could give some information about the 
feasibility of this change request; if this turns out to be difficult or even 
impossible due to the choices made in the architecture, it would be good to 
know that so that we can consider our alternatives.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291737#comment-14291737
 ] 

Sean Owen commented on SPARK-2389:
--

Why can't N front-ends talk to a process built around one long-running Spark 
app? I think that's what the OP is talking about, and can be done right now. 
One Spark app having many contexts doesn't quite make sense as an app is a 
SparkContext.

But [~meken] are you really talking about HA for the driver?

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5399) tree Losses strings should match loss names

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291741#comment-14291741
 ] 

Apache Spark commented on SPARK-5399:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/4206

> tree Losses strings should match loss names
> ---
>
> Key: SPARK-5399
> URL: https://issues.apache.org/jira/browse/SPARK-5399
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> tree.loss.Losses.fromString expects certain String names for losses.  These 
> do not match the names of the loss classes but should.  I believe these 
> strings were the original names of the losses, and we forgot to correct the 
> strings when we renamed the losses.
> Currently:
> {code}
> case "leastSquaresError" => SquaredError
> case "leastAbsoluteError" => AbsoluteError
> case "logLoss" => LogLoss
> {code}
> Proposed:
> {code}
> case "SquaredError" => SquaredError
> case "AbsoluteError" => AbsoluteError
> case "LogLoss" => LogLoss
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291755#comment-14291755
 ] 

Murat Eken commented on SPARK-2389:
---

Yes [~sowen], it's about HA for the driver. Our approach is to have a single 
app that's responsible for initializing the cache at start up (quite expensive) 
and then serve queries on that cached data (very fast).

When you mention  "N front-ends talking to a process built around one long 
running Spark app that can be done right now", are you referring to something 
like the spark-jobserver (or any alternative) I mentioned? If yes, the problem 
with that is the single point of failure, as we're moving that from the driver 
to the jobserver instance. Or is there something else we've missed?

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291764#comment-14291764
 ] 

Robert Stupp commented on SPARK-2389:
-

[~srowen] that *one long-running* Spark app is the problem. It's a SPOF and it 
does not scale.

It would great to have *some* Spark apps sharing the same data set (thus 
reducing load to backend data stores and benefit from RDD caching).
The "real" clients could then talk (via REST or whatever the "real" application 
does) to these Spark apps.
I don't have anything in mind regarding a "shared sessions" - I'd like to 
"just" have multiple spark clients access the same RDDs.

As [~meken] points out, you have to preload the cache first (which is 
expensive) before you can use it.
(I don't mind to have that cached data considered immutable.)

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5407) No 1.2 AMI available for ec2

2015-01-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Håkan Jonsson closed SPARK-5407.

Resolution: Invalid

Error on my side.

> No 1.2 AMI available for ec2
> 
>
> Key: SPARK-5407
> URL: https://issues.apache.org/jira/browse/SPARK-5407
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Håkan Jonsson
>
> When I try to launch a standalone cluster on EC2 using the scripts in the ec2 
> directory for Spark 1.2, (./spark-ec2 -k spark -i k.pem launch my12), I get 
> the following error: 
> Could not resolve AMI at: 
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm
> It seems there is not yet any AMI available on EC2 for Spark 1.2.
> This works well for Spark 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291788#comment-14291788
 ] 

Sean Owen commented on SPARK-2389:
--

Yes, the SPOF problem makes sense. It doesn't seem to be what this JIRA was 
about though, which seems to be what the jobserver-style approach addresses.

That aside, why doesn't it scale? because of work that needs to be done on the 
driver? You can of course still run a bunch of drivers, just not one per client.

The preloading cache issue is what off-heap caching in Tachyon or HDFS is 
supposed to ameliorate.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291795#comment-14291795
 ] 

Murat Eken commented on SPARK-2389:
---

[~sowen], I think Robert is talking about fault tolerance when he mentions 
scalability. Anyway, as I mentioned in my original comment, Tachyon is not an 
option, at least for us, due to interprocess serialization/deserialization 
costs. Although we haven't tried HDFS, but I would be surprised if that 
performed differently.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291797#comment-14291797
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. That aside, why doesn't it scale?

Simply because it's just a single Spark client. If that machine's at its limit 
for whatever reason (VM memory, OS resources, CPU, network, ...), that's it.

Sure, you can run multiple drivers - but each has its own, private set of data.

IMO separate preloading is nice for some applications. But data is usually not 
immutable. By example:
* Imagine an application that provides offers for flights worldwide. It's a 
huge amount of data and a huge amount of processing. It cannot be simply 
preloaded - prices for tickets vary from minute to minute based on booking 
status etc etc etc
* Overall data set is quite big
* Overall load is too big for a single driver to handle - imagine thousands of 
offer requests per second
* Failure of a single driver is an absolute no-go
* All clients have to access the same set of data
* Preloading is just impossible during runtime (just at initial deployment)

So - a suitable approach would be to have:
* a Spark cluster holding all the RDDs and doing all offer and booking related 
operations
* a set of Spark clients to "abstract" Spark from the rest of the application
* a huge number of non-uniform frontend clients (could be web app servers, rich 
clients, SOAP / REST frontends)
* everything (except the data) stateless

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-665) Create RPM packages for Spark

2015-01-26 Thread Christian Tzolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291799#comment-14291799
 ] 

Christian Tzolov commented on SPARK-665:


I've looked at the JRPM maven plugin but unlike the jdeb one, JRPM depends on 
native rpm libraries (e.g. it is platform dependent)

My somewhat pragmatic approach to build Spark RPM is to reuse the existing deb 
package and convert it into rpm using the 'alien' tool. 

I've wrapped the spark-rpm build pipeline into a parameterizable docker 
container: 
https://registry.hub.docker.com/u/tzolov/apache-spark-build-pipeline




> Create RPM packages for Spark
> -
>
> Key: SPARK-665
> URL: https://issues.apache.org/jira/browse/SPARK-665
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> This could be doable with the JRPM Maven plugin, similar to how we make 
> Debian packages now, but I haven't looked into it. The plugin is described at 
> http://jrpm.sourceforge.net.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291798#comment-14291798
 ] 

Robert Stupp commented on SPARK-2389:
-

bq. fault tolerance when he mentions scalability

both play well together in a stateless application ;)

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291804#comment-14291804
 ] 

Sean Owen commented on SPARK-2389:
--

Yes, makes sense. Maxing out one driver isn't an issue since you can have many 
drivers (or push work into the cluster). The issue is really that each driver 
then has its own RDDs, and if you need 100s of drivers to keep up, that just 
won't work. (Although then I'd question how so much work is being done on the 
Spark driver?)

In theory the redundancy of all those RDDs is what HDFS caching and Tachyon 
could in theory help with, although those help share outside Spark. Whether 
that works for a particular use case right now is a different question, 
although I suspect it makes more sense to make those work than start yet 
another solution.

What you are describing -- mutating lots shared in-memory state -- doesn't 
sound like a problem Spark helps solve per se. That is, it doesn't sound like 
work that has to live in a Spark driver program, even if it needs to ask a 
Spark driver-based service for some results. Naturally you know your problem 
better than I, but I am wondering if the answer here isn't just using Spark 
differently, for what it's for.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5303) applySchema returns NullPointerException

2015-01-26 Thread Mauro Pirrone (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mauro Pirrone closed SPARK-5303.

Resolution: Not a Problem

> applySchema returns NullPointerException
> 
>
> Key: SPARK-5303
> URL: https://issues.apache.org/jira/browse/SPARK-5303
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mauro Pirrone
>
> The following code snippet returns NullPointerException:
> val result = .
>   
> val rows = result.take(10)
> val rowRdd = SparkManager.getContext().parallelize(rows, 1)
> val schemaRdd = SparkManager.getSQLContext().applySchema(rowRdd, 
> result.schema)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:147)
>   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
>   at scala.util.hashing.MurmurHash3.listHash(MurmurHash3.scala:168)
>   at scala.util.hashing.MurmurHash3$.seqHash(MurmurHash3.scala:216)
>   at scala.collection.LinearSeqLike$class.hashCode(LinearSeqLike.scala:53)
>   at scala.collection.immutable.List.hashCode(List.scala:84)
>   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
>   at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63)
>   at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210)
>   at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172)
>   at 
> org.apache.spark.sql.execution.LogicalRDD.hashCode(ExistingRDD.scala:58)
>   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
>   at 
> scala.collection.mutable.HashTable$HashUtils$class.elemHashCode(HashTable.scala:398)
>   at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:39)
>   at 
> scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:130)
>   at scala.collection.mutable.HashMap.findEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.get(HashMap.scala:69)
>   at 
> scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187)
>   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:329)
>   at 
> scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.NewRelationInstances$.apply(MultiInstanceRelation.scala:40)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>   at org.apache.spark.sql.SchemaRDD.schema$lzycompute(SchemaRDD.scala:135)
>   at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:135)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Mauro Pirrone (JIRA)
Mauro Pirrone created SPARK-5409:


 Summary: Broken link in documentation
 Key: SPARK-5409
 URL: https://issues.apache.org/jira/browse/SPARK-5409
 Project: Spark
  Issue Type: Documentation
Reporter: Mauro Pirrone
Priority: Minor


https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html

"See the API docs and the example."

Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2015-01-26 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291818#comment-14291818
 ] 

Robert Stupp commented on SPARK-2389:
-

[~srowen] yes, the problem is that drivers cannot share RDDs.
IMHO there are a lot of valid scenarios that can benefit from multiple drivers 
using shared RDDs.

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291823#comment-14291823
 ] 

Sean Owen commented on SPARK-5409:
--

Should just be

https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala

Open a PR; this probably doesn't even need a JIRA.

> Broken link in documentation
> 
>
> Key: SPARK-5409
> URL: https://issues.apache.org/jira/browse/SPARK-5409
> Project: Spark
>  Issue Type: Documentation
>Reporter: Mauro Pirrone
>Priority: Minor
>
> https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html
> "See the API docs and the example."
> Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5324) Results of describe can't be queried

2015-01-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291987#comment-14291987
 ] 

Yanbo Liang commented on SPARK-5324:


[~marmbrus]
I have pull a request for this issue which will implement "DESCRIBE [FORMATTED] 
[db_name.]table_name" command for SQLContext.
Meanwhile, it need to make the least effect on the corresponding command output 
of HiveContext.
And I think other metadata operation command like "show databases/tables, 
analyze, explain" can also leverage this scenario.
Can you assign this to me?

> Results of describe can't be queried
> 
>
> Key: SPARK-5324
> URL: https://issues.apache.org/jira/browse/SPARK-5324
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Michael Armbrust
>
> {code}
> sql("DESCRIBE TABLE test").registerTempTable("describeTest")
> sql("SELECT * FROM describeTest").collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4430) Apache RAT Checks fail spuriously on test files

2015-01-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4430.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

> Apache RAT Checks fail spuriously on test files
> ---
>
> Key: SPARK-4430
> URL: https://issues.apache.org/jira/browse/SPARK-4430
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Ryan Williams
>Assignee: Sean Owen
> Fix For: 1.3.0
>
>
> Several of my recent runs of {{./dev/run-tests}} have failed quickly due to 
> Apache RAT checks, e.g.:
> {code}
> $ ./dev/run-tests
> =
> Running Apache RAT checks
> =
> Could not find Apache license headers in the following files:
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8
>  !? 
> /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9
> [error] Got a return code of 1 on line 114 of the run-tests script.
> {code}
> I think it's fair to say that these are not useful errors for {{run-tests}} 
> to crash on. Ideally we could tell the linter which files we care about 
> having it lint and which we don't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5324) Results of describe can't be queried

2015-01-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292008#comment-14292008
 ] 

Yanbo Liang commented on SPARK-5324:


https://github.com/apache/spark/pull/4207

> Results of describe can't be queried
> 
>
> Key: SPARK-5324
> URL: https://issues.apache.org/jira/browse/SPARK-5324
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Michael Armbrust
>
> {code}
> sql("DESCRIBE TABLE test").registerTempTable("describeTest")
> sql("SELECT * FROM describeTest").collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3852) Document spark.driver.extra* configs

2015-01-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3852.
--
  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Sean Owen
Target Version/s:   (was: 1.2.0)

> Document spark.driver.extra* configs
> 
>
> Key: SPARK-3852
> URL: https://issues.apache.org/jira/browse/SPARK-3852
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Sean Owen
> Fix For: 1.3.0
>
>
> They are not documented...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5410) Error parsing scientific notation in a select statement

2015-01-26 Thread Hugo Ferrira (JIRA)
Hugo Ferrira created SPARK-5410:
---

 Summary: Error parsing scientific notation in a select statement
 Key: SPARK-5410
 URL: https://issues.apache.org/jira/browse/SPARK-5410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hugo Ferrira


I am using the Cassandra DB and am attempting a select through the Spark SQL 
interface.

SELECT * from key_value WHERE f2 < 2.2E10

And get the following error (no error if I remove the E10):

[info] - should be able to select a subset of applicable features *** FAILED ***
[info]   java.lang.RuntimeException: [1.39] failure: ``UNION'' expected but 
identifier E10 found
[info] 
[info] SELECT * from key_value WHERE f2 < 2.2E10
[info]   ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292048#comment-14292048
 ] 

Apache Spark commented on SPARK-5355:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4208

> SparkConf is not thread-safe
> 
>
> Key: SPARK-5355
> URL: https://issues.apache.org/jira/browse/SPARK-5355
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0, 1.2.1
>
>
> The SparkConf is not thread-safe, but is accessed by many threads. The 
> getAll() could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-595) Document "local-cluster" mode

2015-01-26 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292052#comment-14292052
 ] 

Vladimir Grigor commented on SPARK-595:
---

+1 for reopen

> Document "local-cluster" mode
> -
>
> Key: SPARK-595
> URL: https://issues.apache.org/jira/browse/SPARK-595
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 0.6.0
>Reporter: Josh Rosen
>Priority: Minor
>
> The 'Spark Standalone Mode' guide describes how to manually launch a 
> standalone cluster, which can be done locally for testing, but it does not 
> mention SparkContext's `local-cluster` option.
> What are the differences between these approaches?  Which one should I prefer 
> for local testing?  Can I still use the standalone web interface if I use 
> 'local-cluster' mode?
> It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5162) Python yarn-cluster mode

2015-01-26 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292069#comment-14292069
 ] 

Vladimir Grigor commented on SPARK-5162:


I second [~jared.holmb...@orchestro.com]
[~lianhuiwang] thank you! I'm going to try your PR.

Related issue
Even with this PR, there will be problem using Yarn in cluster mode on Amazon 
EMR.

Normally one submits yarn "jobs" via API or aws command line utility, so paths 
to files are evaluated later at some remote host, hence files are not found. 
Currently Spark does not support non-local files. One idea would be to add 
support for non-local (python) files, eg: if file is not local it will be 
downloaded and made available locally. Something similar to "Distributed Cache" 
described at 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html

So following code would work:
{code}
aws emr add-steps --cluster-id "j-XYWIXMD234" \
--steps 
Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE
{code}

What do you think? What is your way to run batch python spark scripts on Yarn 
in Amazon?

> Python yarn-cluster mode
> 
>
> Key: SPARK-5162
> URL: https://issues.apache.org/jira/browse/SPARK-5162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, YARN
>Reporter: Dana Klassen
>  Labels: cluster, python, yarn
>
> Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would 
> be great to be able to submit python applications to the cluster and (just 
> like java classes) have the resource manager setup an AM on any node in the 
> cluster. Does anyone know the issues blocking this feature? I was snooping 
> around with enabling python apps:
> Removing the logic stopping python and yarn-cluster from sparkSubmit.scala
> ...
> // The following modes are not supported or applicable
> (clusterManager, deployMode) match {
>   ...
>   case (_, CLUSTER) if args.isPython =>
> printErrorAndExit("Cluster deploy mode is currently not supported for 
> python applications.")
>   ...
> }
> …
> and submitting application via:
> HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster 
> --num-executors 2  —-py-files {{insert location of egg here}} 
> --executor-cores 1  ../tools/canary.py
> Everything looks to run alright, pythonRunner is picked up as main class, 
> resources get setup, yarn client gets launched but falls flat on its face:
> 2015-01-08 18:48:03,444 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, 
> 1420742868009, FILE, null }, Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed 
> on src filesystem (expected 1420742868009, was 1420742869284
> and
> 2015-01-08 18:48:03,446 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py)
>  transitioned from DOWNLOADING to FAILED
> Tracked this down to the apache hadoop code(FSDownload.java line 249) related 
> to container localization of files upon downloading. At this point thought it 
> would be best to raise the issue here and get input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292121#comment-14292121
 ] 

Mark Khaitman commented on SPARK-5395:
--

Having the same issue in standalone deployment mode. A single spark-submitted 
job is spawning a ton of pyspark.daemon instances and depleting the cluster 
memory even though the appropriate environment variables have been set.

> Large number of Python workers causing resource depletion
> -
>
> Key: SPARK-5395
> URL: https://issues.apache.org/jira/browse/SPARK-5395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: AWS ElasticMapReduce
>Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_30 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>   [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292176#comment-14292176
 ] 

Joseph K. Bradley commented on SPARK-5400:
--

I agree this could be done either way: Algorithm[Model] or Model[Algorithm].  
For users, exposing the model type may be easiest; a person who is new to ML 
and wants to do some clustering will know the name of a clustering model 
(KMeans, GMM) but may not want to worry about picking an optimization 
algorithm.  So I'd vote for Model[Algorithm].

That said, internally, I agree that Algorithm[Model] would be handy for 
generalizing.  We could do the combination by having an internal LearningState 
class:

{code}
class GaussianMixture {
  def setOptimizer // once we have more than 1 optimization method

  def run = {
val opt = new EM(new GMMLearningState(this))
...
  }
}

private[mllib] GMMLearningState extends OurModelAbstraction {
  def this(gm: GaussianMixture) = this(...)
}

class EM(model: OurModelAbstraction)
{code}


> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-01-26 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292303#comment-14292303
 ] 

Brennon York commented on SPARK-794:


[~joshrosen] How is this PR holding up? I haven't seen any issues on the dev 
board. Think we can close this JIRA ticket? Trying to help prune the JIRA tree 
:)

> Remove sleep() in ClusterScheduler.stop
> ---
>
> Key: SPARK-794
> URL: https://issues.apache.org/jira/browse/SPARK-794
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Matei Zaharia
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-595) Document "local-cluster" mode

2015-01-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-595:
--

I've re-opened this issue.  Folks are using the API in the wild and we're not 
going to break compatibility for it, so we should document it.

> Document "local-cluster" mode
> -
>
> Key: SPARK-595
> URL: https://issues.apache.org/jira/browse/SPARK-595
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 0.6.0
>Reporter: Josh Rosen
>Priority: Minor
>
> The 'Spark Standalone Mode' guide describes how to manually launch a 
> standalone cluster, which can be done locally for testing, but it does not 
> mention SparkContext's `local-cluster` option.
> What are the differences between these approaches?  Which one should I prefer 
> for local testing?  Can I still use the standalone web interface if I use 
> 'local-cluster' mode?
> It would be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-01-26 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292370#comment-14292370
 ] 

Imran Rashid commented on SPARK-3644:
-

[~joshrosen] Hi Josh, I've got time to implement this now.  You can assign to 
me if you like (or let me know if there is something else in the works ...)

> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, how 
> the API will be documented and tested, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-01-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5236:
--
Description: 
{code}
15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
at 
parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
at 
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
at 
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
... 27 more
{code}

  was:
15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.Abstr

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292399#comment-14292399
 ] 

Dmitriy Lyubimov commented on SPARK-5226:
-

All attempts to parallelize dbscan in literature lately (or similar DeLiClu 
type of things) i read about include partitioning the task into smaller 
subtasks, solving each on individual level and merging it all back (see MR.Scan 
paper for example). Merging is of course is the new and the tricky thing.

As far as i understand, they all pretty much have limitations to reduce scope 
to euclidean distances  and captitalize on notions of euclidean geometry 
resulting from that, in order to solve partition and merge problems. Which 
substantially reduces attractiveness of general algorithm. However, the naive 
straightforward port of  simple DBScan algorithm is not terribly practical for 
big data because of total complexity of the problem (or impracticality of 
building something like huge distributed R-tree index system on shared-nothing 
programming models).

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292415#comment-14292415
 ] 

Xuefu Zhang commented on SPARK-2688:


#1 above is exactly what Hive needs badly.

> Need a way to run multiple data pipeline concurrently
> -
>
> Key: SPARK-2688
> URL: https://issues.apache.org/jira/browse/SPARK-2688
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>| -> rdd4
>| -> rdd5
>\ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5339) build/mvn doesn't work because of invalid URL for maven's tgz.

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5339.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Kousuke Saruta

> build/mvn doesn't work because of invalid URL for maven's tgz.
> --
>
> Key: SPARK-5339
> URL: https://issues.apache.org/jira/browse/SPARK-5339
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 1.3.0
>
>
> build/mvn will automatically download tarball of maven. But currently, the 
> URL is invalid. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292431#comment-14292431
 ] 

Sean Owen commented on SPARK-2688:
--

As [~irashid] says, #1 is just syntactic sugar on what you can do already in 
Spark. I'm not clear how something can need this functionality badly, then. 
Either it's not blocking anything, really, and let's see that, or let's discuss 
what beyond #1 is actually needed.

What I think people want is a miniature "push-based evaluation" method inside 
of Spark's pull-based DAG evaluation: force evaluation of N children of 1 
parent at once. The outcome of a sidebar I had with Sandy on this was that it's 
probably a) fraught with gotchas, given the push-vs-pull mismatch, but not 
impossible, and b) would force the children to be persisted in the general 
case, with possible optimizations in other special cases.

Is that the kind of thing Hive on Spark needs, and if so can we hear a concrete 
elaboration of an example of this, so we can compare with what's possible now? 
I still sense there's a mismatch between the perception and reality of what's 
possible with the current API. Hence, there may be some really good news here.

> Need a way to run multiple data pipeline concurrently
> -
>
> Key: SPARK-2688
> URL: https://issues.apache.org/jira/browse/SPARK-2688
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>| -> rdd4
>| -> rdd5
>\ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-01-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292448#comment-14292448
 ] 

Xuefu Zhang commented on SPARK-2688:


Yeah. We don't need a syntactic suger, but a transformation that just does one 
pass of the input RDD. This has performance implications on Hive's multi-insert 
use cases.

> Need a way to run multiple data pipeline concurrently
> -
>
> Key: SPARK-2688
> URL: https://issues.apache.org/jira/browse/SPARK-2688
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>| -> rdd4
>| -> rdd5
>\ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292468#comment-14292468
 ] 

Kushal Datta commented on SPARK-3789:
-

Hi Ameet,

Sorry for asking this question again.
What's the release plan for 1.3?

-Kushal.

> Python bindings for GraphX
> --
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292477#comment-14292477
 ] 

Reynold Xin commented on SPARK-3789:


Unfortunately this is not going to make it into 1.3, given the code freeze 
deadline is in 1 week.

[~kdatta1978] thanks for working on this. Can you write some high level design 
document for this change?

> Python bindings for GraphX
> --
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext

2015-01-26 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5411:
-

 Summary: Allow SparkListeners to be specified in SparkConf and 
loaded when creating SparkContext
 Key: SPARK-5411
 URL: https://issues.apache.org/jira/browse/SPARK-5411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


It would be nice if there was a mechanism to allow SparkListeners to be 
registered through SparkConf settings.  This would allow monitoring frameworks 
to be easily injected into Spark programs without having to modify those 
programs' code.

I propose to introduce a new configuration option, {{spark.extraListeners}}, 
that allows SparkListeners to be specified in SparkConf and registered before 
the SparkContext is created.  Here is the proposed documentation for the new 
option:

{quote}
A comma-separated list of classes that implement SparkListener; when 
initializing SparkContext, instances of these classes will be created and 
registered with Spark's listener bus. If a class has a single-argument 
constructor that accepts a SparkConf, that constructor will be called; 
otherwise, a zero-argument constructor will be called. If no valid constructor 
can be found, the SparkContext creation will fail with an exception.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5411) Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292494#comment-14292494
 ] 

Apache Spark commented on SPARK-5411:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4111

> Allow SparkListeners to be specified in SparkConf and loaded when creating 
> SparkContext
> ---
>
> Key: SPARK-5411
> URL: https://issues.apache.org/jira/browse/SPARK-5411
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> It would be nice if there was a mechanism to allow SparkListeners to be 
> registered through SparkConf settings.  This would allow monitoring 
> frameworks to be easily injected into Spark programs without having to modify 
> those programs' code.
> I propose to introduce a new configuration option, {{spark.extraListeners}}, 
> that allows SparkListeners to be specified in SparkConf and registered before 
> the SparkContext is created.  Here is the proposed documentation for the new 
> option:
> {quote}
> A comma-separated list of classes that implement SparkListener; when 
> initializing SparkContext, instances of these classes will be created and 
> registered with Spark's listener bus. If a class has a single-argument 
> constructor that accepts a SparkConf, that constructor will be called; 
> otherwise, a zero-argument constructor will be called. If no valid 
> constructor can be found, the SparkContext creation will fail with an 
> exception.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Kushal Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292499#comment-14292499
 ] 

Kushal Datta commented on SPARK-3789:
-

Sure, i will write up the design document.
@ Ameet,  do you think you can work from another branch which is not on 1.3?



> Python bindings for GraphX
> --
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4147) Reduce log4j dependency

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4147:
---
Assignee: Sean Owen

> Reduce log4j dependency
> ---
>
> Key: SPARK-4147
> URL: https://issues.apache.org/jira/browse/SPARK-4147
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tobias Pfeiffer
>Assignee: Sean Owen
>
> spark-core has a hard dependency on log4j, which shouldn't be necessary since 
> slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
> sbt file.
> Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
> However, removing the log4j dependency fails because in 
> https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
>  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
> is not in use.
> I guess removing all dependencies on log4j may be a bigger task, but it would 
> be a great help if the access to LogManager would be done only if log4j use 
> was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5412) Cannot bind Master to a specific hostname as per the documentation

2015-01-26 Thread Alexis Seigneurin (JIRA)
Alexis Seigneurin created SPARK-5412:


 Summary: Cannot bind Master to a specific hostname as per the 
documentation
 Key: SPARK-5412
 URL: https://issues.apache.org/jira/browse/SPARK-5412
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Alexis Seigneurin


Documentation on http://spark.apache.org/docs/latest/spark-standalone.html 
indicates:

{quote}
You can start a standalone master server by executing:
./sbin/start-master.sh
...
the following configuration options can be passed to the master and worker:
...
-h HOST, --host HOSTHostname to listen on
{quote}

The "\-h" or "--host" parameter actually doesn't work with the start-master.sh 
script. Instead, one has to set the "SPARK_MASTER_IP" variable prior to 
executing the script.

Either the script or the documentation should be updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4147) Reduce log4j dependency

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4147:
---
Affects Version/s: 1.2.0

> Reduce log4j dependency
> ---
>
> Key: SPARK-4147
> URL: https://issues.apache.org/jira/browse/SPARK-4147
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Tobias Pfeiffer
>Assignee: Sean Owen
>
> spark-core has a hard dependency on log4j, which shouldn't be necessary since 
> slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
> sbt file.
> Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
> However, removing the log4j dependency fails because in 
> https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
>  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
> is not in use.
> I guess removing all dependencies on log4j may be a bigger task, but it would 
> be a great help if the access to LogManager would be done only if log4j use 
> was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4147) Reduce log4j dependency

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4147.

Resolution: Fixed

> Reduce log4j dependency
> ---
>
> Key: SPARK-4147
> URL: https://issues.apache.org/jira/browse/SPARK-4147
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Tobias Pfeiffer
>Assignee: Sean Owen
> Fix For: 1.3.0, 1.2.1
>
>
> spark-core has a hard dependency on log4j, which shouldn't be necessary since 
> slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
> sbt file.
> Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
> However, removing the log4j dependency fails because in 
> https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
>  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
> is not in use.
> I guess removing all dependencies on log4j may be a bigger task, but it would 
> be a great help if the access to LogManager would be done only if log4j use 
> was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4147) Reduce log4j dependency

2015-01-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4147:
---
Fix Version/s: 1.2.1
   1.3.0

> Reduce log4j dependency
> ---
>
> Key: SPARK-4147
> URL: https://issues.apache.org/jira/browse/SPARK-4147
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Tobias Pfeiffer
>Assignee: Sean Owen
> Fix For: 1.3.0, 1.2.1
>
>
> spark-core has a hard dependency on log4j, which shouldn't be necessary since 
> slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
> sbt file.
> Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
> However, removing the log4j dependency fails because in 
> https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
>  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
> is not in use.
> I guess removing all dependencies on log4j may be a bigger task, but it would 
> be a great help if the access to LogManager would be done only if log4j use 
> was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-960) JobCancellationSuite "two jobs sharing the same stage" is broken

2015-01-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-960.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4180
[https://github.com/apache/spark/pull/4180]

> JobCancellationSuite "two jobs sharing the same stage" is broken
> 
>
> Key: SPARK-960
> URL: https://issues.apache.org/jira/browse/SPARK-960
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Mark Hamstra
> Fix For: 1.3.0
>
>
> This test doesn't work as it appears to be intended since the map tasks can 
> never acquire sem2.  The simplest way to demonstrate this is to comment out 
> f1.cancel() in the future.  I believe the intention is that f1 and f2 would 
> then complete normally; but they won't.  Instead, both jobs block, waiting on 
> sem2.  It doesn't look like closing over Semaphores works even in a Local 
> context, since sem2.hashCode() is different in each of f1, f2 and in the 
> future containing f1.cancel, so the map jobs never see the sem2.release(10) 
> in the future.
> Instead, the test only completes because all of the stages (the two final 
> stages and the common dependent stage) get cancelled and aborted.  When job 
> <--> stage dependencies are fully accounted for and job cancellation changed 
> so that f1.cancel does not abort the common stage, then this test can never 
> finish since it then becomes hung waiting on sem2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-960) JobCancellationSuite "two jobs sharing the same stage" is broken

2015-01-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-960:
-
Assignee: Sean Owen

> JobCancellationSuite "two jobs sharing the same stage" is broken
> 
>
> Key: SPARK-960
> URL: https://issues.apache.org/jira/browse/SPARK-960
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Mark Hamstra
>Assignee: Sean Owen
> Fix For: 1.3.0
>
>
> This test doesn't work as it appears to be intended since the map tasks can 
> never acquire sem2.  The simplest way to demonstrate this is to comment out 
> f1.cancel() in the future.  I believe the intention is that f1 and f2 would 
> then complete normally; but they won't.  Instead, both jobs block, waiting on 
> sem2.  It doesn't look like closing over Semaphores works even in a Local 
> context, since sem2.hashCode() is different in each of f1, f2 and in the 
> future containing f1.cancel, so the map jobs never see the sem2.release(10) 
> in the future.
> Instead, the test only completes because all of the stages (the two final 
> stages and the common dependent stage) get cancelled and aborted.  When job 
> <--> stage dependencies are fully accounted for and job cancellation changed 
> so that f1.cancel does not abort the common stage, then this test can never 
> finish since it then becomes hung waiting on sem2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-926) spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null

2015-01-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292549#comment-14292549
 ] 

Josh Rosen commented on SPARK-926:
--

I think it is; there's now a PR to fix this, since it's also the cause of 
SPARK-5403: https://github.com/apache/spark/pull/4196

> spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null
> ---
>
> Key: SPARK-926
> URL: https://issues.apache.org/jira/browse/SPARK-926
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 0.8.0
>Reporter: Shay Seng
>Priority: Trivial
>
> The know host file in the local machine gets all kinds of crap after a few 
> cluster launches. When SSHing, or SCPing, please add "-o 
> UserKnowHostFile=/dev/null"
> Also remove the -t option from SSH, and only add in when necessary - to 
> reduce chatter on console. 
> e.g.
> # Copy a file to a given host through scp, throwing an exception if scp fails
> def scp(host, opts, local_file, dest_file):
>   subprocess.check_call(
>   "scp -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i 
> %s '%s' '%s@%s:%s'" %
>   (opts.identity_file, local_file, opts.user, host, dest_file), 
> shell=True)
> # Run a command on a host through ssh, retrying up to two times
> # and then throwing an exception if ssh continues to fail.
> def ssh(host, opts, command, sshopts=""):
>   tries = 0
>   while True:
> try:
>   # removed -t option from ssh command, not sure why it is required all 
> the time.
>   return subprocess.check_call(
> "ssh %s -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null 
> -i %s %s@%s '%s'" %
> (sshopts, opts.identity_file, opts.user, host, command), shell=True)
> except subprocess.CalledProcessError as e:
>   if (tries > 2):
> raise e
>   print "Couldn't connect to host {0}, waiting 30 seconds".format(e)
>   time.sleep(30)
>   tries = tries + 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-26 Thread Ameet Talwalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292602#comment-14292602
 ] 

Ameet Talwalkar commented on SPARK-3789:


Unfortunately not -- I plan to use a standard release of Spark for my MOOC.

On Mon, Jan 26, 2015 at 2:10 PM, Kushal Datta (JIRA) 



> Python bindings for GraphX
> --
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Sven Krasser (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292630#comment-14292630
 ] 

Sven Krasser commented on SPARK-5395:
-

[~mkman84], do you also see this for both spark.python.worker.reuse false/true? 
(FWIW, the run I pasted above had reuse disabled.)

Also, do you happen to have a small job that can be used as a repro ([~davies] 
was asking for one, so far I only managed to trigger this condition using 
production data).

> Large number of Python workers causing resource depletion
> -
>
> Key: SPARK-5395
> URL: https://issues.apache.org/jira/browse/SPARK-5395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: AWS ElasticMapReduce
>Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_30 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>   [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292668#comment-14292668
 ] 

Mark Khaitman commented on SPARK-5395:
--

[~skrasser], I actually only managed to have this reproduced using production 
data as well (so far). I'll try to write a simple version tomorrow but it seems 
that it's a mix of both python worker processes not being killed after it's no 
longer running (causing build up), as well as the python worker exceeding the 
allocated memory limit. 

I think it *may* be related to a couple of specific actions such as 
groupByKey/cogroup, though I'll still need to do some tests to be sure what's 
causing this.

> Large number of Python workers causing resource depletion
> -
>
> Key: SPARK-5395
> URL: https://issues.apache.org/jira/browse/SPARK-5395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: AWS ElasticMapReduce
>Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_30 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>   [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292668#comment-14292668
 ] 

Mark Khaitman edited comment on SPARK-5395 at 1/26/15 11:57 PM:


[~skrasser], I actually only managed to have this reproduced using production 
data as well (so far). I'll try to write a simple version tomorrow but it seems 
that it's a mix of both python worker processes not being killed after it's no 
longer running (causing build up), as well as the python worker exceeding the 
allocated memory limit. 

I think it *may* be related to a couple of specific actions such as 
groupByKey/cogroup, though I'll still need to do some tests to be sure what's 
causing this.

I should also add that we haven't modified the default for the 
python.worker.reuse variable, so in our case it should be using the default of 
True.


was (Author: mkman84):
[~skrasser], I actually only managed to have this reproduced using production 
data as well (so far). I'll try to write a simple version tomorrow but it seems 
that it's a mix of both python worker processes not being killed after it's no 
longer running (causing build up), as well as the python worker exceeding the 
allocated memory limit. 

I think it *may* be related to a couple of specific actions such as 
groupByKey/cogroup, though I'll still need to do some tests to be sure what's 
causing this.

> Large number of Python workers causing resource depletion
> -
>
> Key: SPARK-5395
> URL: https://issues.apache.org/jira/browse/SPARK-5395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: AWS ElasticMapReduce
>Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_30 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>   [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5384:
-
Priority: Minor  (was: Critical)

> Vectors.sqdist return inconsistent result for sparse/dense vectors when the 
> vectors have different lengths
> --
>
> Key: SPARK-5384
> URL: https://issues.apache.org/jira/browse/SPARK-5384
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For two vectors of different lengths, Vectors.sqdist would return different 
> result when the vectors are represented as sparse and dense respectively. 
> Sample:   
> val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
> val s2 = new SparseVector(1, Array(0), Array(9.0))
> val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
> val d2 = new DenseVector(Array(9.0))
> println(s1 == d1 && s2 == d2)
> println(Vectors.sqdist(s1, s2))
> println(Vectors.sqdist(d1, d2))
> result:
>  true
>  93.0
>  64.0
> More precisely, for the extra part, Vectors.sqdist would include it for 
> sparse vectors and exclude it for dense vectors. I'll send a PR and we can 
> have more detailed discussion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5384:
-
Affects Version/s: (was: 1.2.1)
   1.3.0

> Vectors.sqdist return inconsistent result for sparse/dense vectors when the 
> vectors have different lengths
> --
>
> Key: SPARK-5384
> URL: https://issues.apache.org/jira/browse/SPARK-5384
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For two vectors of different lengths, Vectors.sqdist would return different 
> result when the vectors are represented as sparse and dense respectively. 
> Sample:   
> val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
> val s2 = new SparseVector(1, Array(0), Array(9.0))
> val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
> val d2 = new DenseVector(Array(9.0))
> println(s1 == d1 && s2 == d2)
> println(Vectors.sqdist(s1, s2))
> println(Vectors.sqdist(d1, d2))
> result:
>  true
>  93.0
>  64.0
> More precisely, for the extra part, Vectors.sqdist would include it for 
> sparse vectors and exclude it for dense vectors. I'll send a PR and we can 
> have more detailed discussion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5384:
-
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

> Vectors.sqdist return inconsistent result for sparse/dense vectors when the 
> vectors have different lengths
> --
>
> Key: SPARK-5384
> URL: https://issues.apache.org/jira/browse/SPARK-5384
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For two vectors of different lengths, Vectors.sqdist would return different 
> result when the vectors are represented as sparse and dense respectively. 
> Sample:   
> val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
> val s2 = new SparseVector(1, Array(0), Array(9.0))
> val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
> val d2 = new DenseVector(Array(9.0))
> println(s1 == d1 && s2 == d2)
> println(Vectors.sqdist(s1, s2))
> println(Vectors.sqdist(d1, d2))
> result:
>  true
>  93.0
>  64.0
> More precisely, for the extra part, Vectors.sqdist would include it for 
> sparse vectors and exclude it for dense vectors. I'll send a PR and we can 
> have more detailed discussion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5384:
-
Target Version/s: 1.3.0, 1.2.1  (was: 1.2.1)

> Vectors.sqdist return inconsistent result for sparse/dense vectors when the 
> vectors have different lengths
> --
>
> Key: SPARK-5384
> URL: https://issues.apache.org/jira/browse/SPARK-5384
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For two vectors of different lengths, Vectors.sqdist would return different 
> result when the vectors are represented as sparse and dense respectively. 
> Sample:   
> val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
> val s2 = new SparseVector(1, Array(0), Array(9.0))
> val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
> val d2 = new DenseVector(Array(9.0))
> println(s1 == d1 && s2 == d2)
> println(Vectors.sqdist(s1, s2))
> println(Vectors.sqdist(d1, d2))
> result:
>  true
>  93.0
>  64.0
> More precisely, for the extra part, Vectors.sqdist would include it for 
> sparse vectors and exclude it for dense vectors. I'll send a PR and we can 
> have more detailed discussion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm

2015-01-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292688#comment-14292688
 ] 

Xiangrui Meng commented on SPARK-3439:
--

[~angellandros] The public API and the complexity analysis are more important 
than implementation details. What is the expected input and output of the 
algorithm and how does it scale?

> Add Canopy Clustering Algorithm
> ---
>
> Key: SPARK-3439
> URL: https://issues.apache.org/jira/browse/SPARK-3439
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Assignee: Muhammad-Ali A'rabi
>Priority: Minor
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
> It is often used as a preprocessing step for the K-means algorithm or the 
> Hierarchical clustering algorithm. It is intended to speed up clustering 
> operations on large data sets, where using another algorithm directly may be 
> impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292695#comment-14292695
 ] 

Xiangrui Meng commented on SPARK-5400:
--

I like `GaussianMixture` better. I don't think it is worth making a generic EM 
algorithm. It is too general and we won't benefit much from code reuse.

> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4587) Model export/import

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4587:
-
Assignee: Joseph K. Bradley

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1856:
-
Target Version/s:   (was: 1.3.0)

> Standardize MLlib interfaces
> 
>
> Key: SPARK-1856
> URL: https://issues.apache.org/jira/browse/SPARK-1856
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Blocker
>
> Instead of expanding MLlib based on the current class naming scheme 
> (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
> clearly separate datasets, formulations, algorithms, parameter sets, and 
> models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1856:
-
Priority: Critical  (was: Blocker)

> Standardize MLlib interfaces
> 
>
> Key: SPARK-1856
> URL: https://issues.apache.org/jira/browse/SPARK-1856
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Instead of expanding MLlib based on the current class naming scheme 
> (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
> clearly separate datasets, formulations, algorithms, parameter sets, and 
> models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1486:
-
Target Version/s:   (was: 1.3.0)

> Support multi-model training in MLlib
> -
>
> Key: SPARK-1486
> URL: https://issues.apache.org/jira/browse/SPARK-1486
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>Priority: Critical
>
> It is rare in practice to train just one model with a given set of 
> parameters. Usually, this is done by training multiple models with different 
> sets of parameters and then select the best based on their performance on the 
> validation set. MLlib should provide native support for multi-model 
> training/scoring. It requires decoupling of concepts like problem, 
> formulation, algorithm, parameter set, and model, which are missing in MLlib 
> now. MLI implements similar concepts, which we can borrow. There are 
> different approaches for multi-model training:
> 0) Keep one copy of the data, and train models one after another (or maybe in 
> parallel, depending on the scheduler).
> 1) Keep one copy of the data, and train multiple models at the same time 
> (similar to `runs` in KMeans).
> 2) Make multiple copies of the data (still stored distributively), and use 
> more cores to distribute the work.
> 3) Collect the data, make the entire dataset available on workers, and train 
> one or more models on each worker.
> Users should be able to choose which execution mode they want to use. Note 
> that 3) could cover many use cases in practice when the training data is not 
> huge, e.g., <1GB.
> This task will be divided into sub-tasks and this JIRA is created to discuss 
> the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4589) ML add-ons to SchemaRDD

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4589.

Resolution: Duplicate

I'm closing this JIRA in favor of the DataFrame API.

> ML add-ons to SchemaRDD
> ---
>
> Key: SPARK-4589
> URL: https://issues.apache.org/jira/browse/SPARK-4589
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, SQL
>Reporter: Xiangrui Meng
>
> One feedback we received from the Pipeline API (SPARK-3530) is about the 
> boilerplate code in the implementation. We can add more Scala DSL to simplify 
> the code for the operations we need in ML. Those operations could live under 
> spark.ml via implicit, or be added to SchemaRDD directly if they are also 
> useful for general purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5094:
-
Assignee: Kazuki Taniguchi

> Python API for gradient-boosted trees
> -
>
> Key: SPARK-5094
> URL: https://issues.apache.org/jira/browse/SPARK-5094
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Kazuki Taniguchi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Target Version/s:   (was: 1.3.0)

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5321) Add transpose() method to Matrix

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5321:
-
Assignee: Burak Yavuz

> Add transpose() method to Matrix
> 
>
> Key: SPARK-5321
> URL: https://issues.apache.org/jira/browse/SPARK-5321
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> While we are working on BlockMatrix, it will be nice to add the support to 
> transpose matrices. .transpose() will just modify a private flag in local 
> matrices. Operations that follow will be performed based on this flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5114) Should Evaluator be a PipelineStage

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5114:
-
Summary: Should Evaluator be a PipelineStage  (was: Should Evaluator by a 
PipelineStage)

> Should Evaluator be a PipelineStage
> ---
>
> Key: SPARK-5114
> URL: https://issues.apache.org/jira/browse/SPARK-5114
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> Pipelines can currently contain Estimators and Transformers.
> Question for debate: Should Pipelines be able to contain Evaluators?
> Pros:
> * Evaluators take input datasets with particular schema, which should perhaps 
> be checked before running a Pipeline.
> Cons:
> * Evaluators do not transform datasets.   They produce a scalar (or a few 
> values), which makes it hard to say how they fit into a Pipeline or a 
> PipelineModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0

2015-01-26 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5413:
-
Description: 
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like [batching metrics in 
TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and 
supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.

  was:
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like batching metrics in 
TCP and supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.


> Upgrade "metrics" dependency to 3.1.0
> -
>
> Key: SPARK-5413
> URL: https://issues.apache.org/jira/browse/SPARK-5413
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> Spark currently uses Coda Hale's metrics library version {{3.0.0}}.
> Version {{3.1.0}} includes some useful improvements, like [batching metrics 
> in 
> TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] 
> and supporting Graphite's UDP interface.
> I'd like to bump Spark's version to take advantage of these; I'm playing with 
> GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5413:


 Summary: Upgrade "metrics" dependency to 3.1.0
 Key: SPARK-5413
 URL: https://issues.apache.org/jira/browse/SPARK-5413
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams


Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like batching metrics in 
TCP and supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0

2015-01-26 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5413:
-
Description: 
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like [batching metrics in 
TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and 
[supporting Graphite's UDP 
interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java].

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.

  was:
Spark currently uses Coda Hale's metrics library version {{3.0.0}}.

Version {{3.1.0}} includes some useful improvements, like [batching metrics in 
TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] and 
supporting Graphite's UDP interface.

I'd like to bump Spark's version to take advantage of these; I'm playing with 
GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.


> Upgrade "metrics" dependency to 3.1.0
> -
>
> Key: SPARK-5413
> URL: https://issues.apache.org/jira/browse/SPARK-5413
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> Spark currently uses Coda Hale's metrics library version {{3.0.0}}.
> Version {{3.1.0}} includes some useful improvements, like [batching metrics 
> in 
> TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] 
> and [supporting Graphite's UDP 
> interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java].
> I'd like to bump Spark's version to take advantage of these; I'm playing with 
> GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"

2015-01-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292718#comment-14292718
 ] 

Xiangrui Meng commented on SPARK-4846:
--

[~josephtang] Are you working on this issue? If not, do you mind me sending a 
PR that throwing an exception if vectorSize is beyond limit?

> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: 
> Requested array size exceeds VM limit"
> ---
>
> Key: SPARK-4846
> URL: https://issues.apache.org/jira/browse/SPARK-4846
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0
> Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
> partition.
> The corpus contains about 300 million words and its vocabulary size is about 
> 10 million.
>Reporter: Joseph Tang
>Assignee: Joseph Tang
>Priority: Minor
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at 
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method

2015-01-26 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5414:
-

 Summary: Add SparkListener implementation that allows users to 
receive all listener events in one method
 Key: SPARK-5414
 URL: https://issues.apache.org/jira/browse/SPARK-5414
 Project: Spark
  Issue Type: New Feature
Reporter: Josh Rosen
Assignee: Josh Rosen


Currently, users don't have a very good way to write a SparkListener that 
receives all SparkListener events and which will be future-compatible (e.g. it 
will receive events introduced in newer versions of Spark without having to 
override new methods to process those events).

Therefore, I think Spark should include a concrete SparkListener implementation 
that implements all of the message-handling methods and dispatches all of them 
to a single {{onEvent}} method.  By putting this code in Spark, we isolate 
users from changes to the listener API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5413) Upgrade "metrics" dependency to 3.1.0

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292719#comment-14292719
 ] 

Apache Spark commented on SPARK-5413:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4209

> Upgrade "metrics" dependency to 3.1.0
> -
>
> Key: SPARK-5413
> URL: https://issues.apache.org/jira/browse/SPARK-5413
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> Spark currently uses Coda Hale's metrics library version {{3.0.0}}.
> Version {{3.1.0}} includes some useful improvements, like [batching metrics 
> in 
> TCP|https://github.com/dropwizard/metrics/issues/660#issuecomment-55626888] 
> and [supporting Graphite's UDP 
> interface|https://github.com/dropwizard/metrics/blob/v3.1.0/metrics-graphite/src/main/java/com/codahale/metrics/graphite/GraphiteUDP.java].
> I'd like to bump Spark's version to take advantage of these; I'm playing with 
> GraphiteSink and seeing Graphite struggle to ingest all executors' metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4979) Add streaming logistic regression

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4979:
-
Target Version/s:   (was: 1.3.0)

> Add streaming logistic regression
> -
>
> Key: SPARK-4979
> URL: https://issues.apache.org/jira/browse/SPARK-4979
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>Priority: Minor
>
> We currently support streaming linear regression and k-means clustering. We 
> can add support for streaming logistic regression using a strategy similar to 
> that used in streaming linear regression, applying gradient updates to 
> batches of data from a DStream, and extending the existing mllib methods with 
> minor modifications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292720#comment-14292720
 ] 

Apache Spark commented on SPARK-5414:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4210

> Add SparkListener implementation that allows users to receive all listener 
> events in one method
> ---
>
> Key: SPARK-5414
> URL: https://issues.apache.org/jira/browse/SPARK-5414
> Project: Spark
>  Issue Type: New Feature
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Currently, users don't have a very good way to write a SparkListener that 
> receives all SparkListener events and which will be future-compatible (e.g. 
> it will receive events introduced in newer versions of Spark without having 
> to override new methods to process those events).
> Therefore, I think Spark should include a concrete SparkListener 
> implementation that implements all of the message-handling methods and 
> dispatches all of them to a single {{onEvent}} method.  By putting this code 
> in Spark, we isolate users from changes to the listener API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5415) Upgrade sbt to 0.13.7

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5415:


 Summary: Upgrade sbt to 0.13.7
 Key: SPARK-5415
 URL: https://issues.apache.org/jira/browse/SPARK-5415
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


Spark currently uses sbt {{0.13.6}}, which has a regression related to 
processing parent POM's in Maven projects.

{{0.13.7}} does not have this issue (though it's unclear whether it was fixed 
intentionally), so I'd like to bump up one version.

I ran into this while locally building a Spark assembly against a locally-built 
"metrics" JAR dependency; {{0.13.6}} could not build Spark but {{0.13.7}} 
worked fine. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5414) Add SparkListener implementation that allows users to receive all listener events in one method

2015-01-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5414:
--
Component/s: Spark Core

> Add SparkListener implementation that allows users to receive all listener 
> events in one method
> ---
>
> Key: SPARK-5414
> URL: https://issues.apache.org/jira/browse/SPARK-5414
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Currently, users don't have a very good way to write a SparkListener that 
> receives all SparkListener events and which will be future-compatible (e.g. 
> it will receive events introduced in newer versions of Spark without having 
> to override new methods to process those events).
> Therefore, I think Spark should include a concrete SparkListener 
> implementation that implements all of the message-handling methods and 
> dispatches all of them to a single {{onEvent}} method.  By putting this code 
> in Spark, we isolate users from changes to the listener API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5415) Upgrade sbt to 0.13.7

2015-01-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292723#comment-14292723
 ] 

Sean Owen commented on SPARK-5415:
--

FWIW I use 0.13.7 locally and have had no problems.

> Upgrade sbt to 0.13.7
> -
>
> Key: SPARK-5415
> URL: https://issues.apache.org/jira/browse/SPARK-5415
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Spark currently uses sbt {{0.13.6}}, which has a regression related to 
> processing parent POM's in Maven projects.
> {{0.13.7}} does not have this issue (though it's unclear whether it was fixed 
> intentionally), so I'd like to bump up one version.
> I ran into this while locally building a Spark assembly against a 
> locally-built "metrics" JAR dependency; {{0.13.6}} could not build Spark but 
> {{0.13.7}} worked fine. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5415) Upgrade sbt to 0.13.7

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292722#comment-14292722
 ] 

Apache Spark commented on SPARK-5415:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4211

> Upgrade sbt to 0.13.7
> -
>
> Key: SPARK-5415
> URL: https://issues.apache.org/jira/browse/SPARK-5415
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Spark currently uses sbt {{0.13.6}}, which has a regression related to 
> processing parent POM's in Maven projects.
> {{0.13.7}} does not have this issue (though it's unclear whether it was fixed 
> intentionally), so I'd like to bump up one version.
> I ran into this while locally building a Spark assembly against a 
> locally-built "metrics" JAR dependency; {{0.13.6}} could not build Spark but 
> {{0.13.7}} worked fine. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5409) Broken link in documentation

2015-01-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5409.
--
Resolution: Duplicate

Actually this was already fixed

> Broken link in documentation
> 
>
> Key: SPARK-5409
> URL: https://issues.apache.org/jira/browse/SPARK-5409
> Project: Spark
>  Issue Type: Documentation
>Reporter: Mauro Pirrone
>Priority: Minor
>
> https://spark.apache.org/docs/1.2.0/streaming-kafka-integration.html
> "See the API docs and the example."
> Link to example is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-01-26 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292750#comment-14292750
 ] 

Kai Sasaki commented on SPARK-5261:
---

[~gq] Can you provide us data set? I tried with some patterns of num partitions 
but cannot reproduced it. 

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36)
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-26 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292769#comment-14292769
 ] 

Mark Khaitman commented on SPARK-5395:
--

This may prove to be useful...

I'm watching a presently running spark-submitted job, while watching the 
pyspark.daemon processes. The framework is permitted to only use 8 cores on 
each node with the default python worker memory of 512mb per node (not the 
executor memory which is set to higher than this).

Ignoring the exact RDD actions for a moment, it looks like while it transitions 
from Stage 1 -> Stage 2, it spawned up 8-10 additional pyspark.daemon processes 
making the box use more cores than it was even allowed to... A few seconds 
after that, the other 8 processes entered a sleeping state while still holding 
onto the physical memory it ate up in Stage 1. As soon as Stage 2 finished, 
practically all of the pyspark.daemons vanished and freed up the memory usage. 
I was keeping an eye on 2 random nodes and the exact same thing occurred on 
both. It was also the only currently executing job at the time so there was 
really no other interference/contention for resources.

I will try to provide a bit more detail on the exact transformations/actions 
occurring between the 2 stages, although I know a PartionBy and cogroup are 
occurring at the very least without inspecting the spark-submitted code 
directly.

> Large number of Python workers causing resource depletion
> -
>
> Key: SPARK-5395
> URL: https://issues.apache.org/jira/browse/SPARK-5395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: AWS ElasticMapReduce
>Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_30 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>   [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5416:


 Summary: Initialize Executor.threadPool before ExecutorSource
 Key: SPARK-5416
 URL: https://issues.apache.org/jira/browse/SPARK-5416
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


I recently saw some NPEs from 
[{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44]
 in the first couple seconds of my executors' being initialized.

I think that {{ExecutorSource}} was trying to report these metrics before its 
threadpool was initialized; there are a few LoC between the source being 
registered 
([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82])
 and the threadpool being initialized 
([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]).

We should initialize the threapool before the ExecutorSource is registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292784#comment-14292784
 ] 

Apache Spark commented on SPARK-5416:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4212

> Initialize Executor.threadPool before ExecutorSource
> 
>
> Key: SPARK-5416
> URL: https://issues.apache.org/jira/browse/SPARK-5416
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> I recently saw some NPEs from 
> [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44]
>  in the first couple seconds of my executors' being initialized.
> I think that {{ExecutorSource}} was trying to report these metrics before its 
> threadpool was initialized; there are a few LoC between the source being 
> registered 
> ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82])
>  and the threadpool being initialized 
> ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]).
> We should initialize the threapool before the ExecutorSource is registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5417) Remove redundant executor-ID set() call

2015-01-26 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-5417:


 Summary: Remove redundant executor-ID set() call
 Key: SPARK-5417
 URL: https://issues.apache.org/jira/browse/SPARK-5417
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


{{spark.executor.id}} no longer [needs to be set in 
Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79],
 as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream in 
[SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332].
 Might as well remove the redundant set() in Executor.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5417) Remove redundant executor-ID set() call

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292790#comment-14292790
 ] 

Apache Spark commented on SPARK-5417:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4213

> Remove redundant executor-ID set() call
> ---
>
> Key: SPARK-5417
> URL: https://issues.apache.org/jira/browse/SPARK-5417
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {{spark.executor.id}} no longer [needs to be set in 
> Executor.scala|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L79],
>  as of [#4194|https://github.com/apache/spark/pull/4194]; it is set upstream 
> in 
> [SparkEnv|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/SparkEnv.scala#L332].
>  Might as well remove the redundant set() in Executor.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: (was: SparkSQLOnHBase_v2.docx)

> HBase as data source to SparkSQL
> 
>
> Key: SPARK-3880
> URL: https://issues.apache.org/jira/browse/SPARK-3880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yan
>Assignee: Yan
> Attachments: HBaseOnSpark.docx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3562) Periodic cleanup event logs

2015-01-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292818#comment-14292818
 ] 

Apache Spark commented on SPARK-3562:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/4214

> Periodic cleanup event logs
> ---
>
> Key: SPARK-3562
> URL: https://issues.apache.org/jira/browse/SPARK-3562
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: xukun
>
>  If we run spark application frequently, it will write many spark event log 
> into spark.eventLog.dir. After a long time later, there will be many spark 
> event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
> will ensure that logs older than this duration will be forgotten. It is no 
> need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: SparkSQLOnHBase_v2.0.docx

> HBase as data source to SparkSQL
> 
>
> Key: SPARK-3880
> URL: https://issues.apache.org/jira/browse/SPARK-3880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yan
>Assignee: Yan
> Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >