[jira] [Updated] (SPARK-3526) Docs section on data locality

2014-09-15 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3526:
--
Summary: Docs section on data locality  (was: Section on data locality)

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3526) Docs section on data locality

2014-09-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133662#comment-14133662
 ] 

Andrew Ash commented on SPARK-3526:
---

Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3527) Strip the physical plan message margin

2014-09-15 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3527:


 Summary: Strip the physical plan message margin
 Key: SPARK-3527
 URL: https://issues.apache.org/jira/browse/SPARK-3527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3527) Strip the physical plan message margin

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133669#comment-14133669
 ] 

Apache Spark commented on SPARK-3527:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2392

 Strip the physical plan message margin
 --

 Key: SPARK-3527
 URL: https://issues.apache.org/jira/browse/SPARK-3527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2014-09-15 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3528:
-

 Summary: Reading data from file:/// should be called NODE_LOCAL 
not PROCESS_LOCAL
 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash


Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task

{noformat}
14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:20862+20863
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:0+20862
{noformat}

There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:

{noformat}
  override def getPreferredLocations(split: Partition): Seq[String] = {
// TODO: Filtering out localhost in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopPartition]
hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
  }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3526) Docs section on data locality

2014-09-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133662#comment-14133662
 ] 

Andrew Ash edited comment on SPARK-3526 at 9/15/14 8:14 AM:


Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

Edit: repro'd and filed as SPARK-3528


was (Author: aash):
Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2014-09-15 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3528:
--
Description: 
Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task

{noformat}
spark sc.textFile(pom.xml).count
...
14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:20862+20863
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:0+20862
{noformat}

There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:

{noformat}
  override def getPreferredLocations(split: Partition): Seq[String] = {
// TODO: Filtering out localhost in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopPartition]
hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
  }
{noformat}

  was:
Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task

{noformat}
14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:20862+20863
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:0+20862
{noformat}

There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:

{noformat}
  override def getPreferredLocations(split: Partition): Seq[String] = {
// TODO: Filtering out localhost in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopPartition]
hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
  }
{noformat}


 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3529) Delete the temporal files after test exit

2014-09-15 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3529:


 Summary: Delete the temporal files after test exit
 Key: SPARK-3529
 URL: https://issues.apache.org/jira/browse/SPARK-3529
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3529) Delete the temporal files after test exit

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133683#comment-14133683
 ] 

Apache Spark commented on SPARK-3529:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2393

 Delete the temporal files after test exit
 -

 Key: SPARK-3529
 URL: https://issues.apache.org/jira/browse/SPARK-3529
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3530) Pipeline and Parameters

2014-09-15 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3530:


 Summary: Pipeline and Parameters
 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


This part of the design doc is for pipelines and parameters. I put the design 
doc at

https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

I will copy the proposed interfaces to this JIRA later. Some sample code can be 
viewed at: https://github.com/mengxr/spark-ml/

Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133694#comment-14133694
 ] 

Saisai Shao commented on SPARK-2926:


Hey [~rxin], here is the branch rebased on your code 
(https://github.com/jerryshao/apache-spark/tree/sort-shuffle-read-new-netty), 
mind taking a look at it? Thanks a lot.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
 Report(contd).pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven

2014-09-15 Thread Radim Kolar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar closed SPARK-3521.
--
   Resolution: Not a Problem
Fix Version/s: 1.1.1

Compile problem is fixed on github branch-1.1

 Missing modules in 1.1.0 source distribution - cant be build with maven
 ---

 Key: SPARK-3521
 URL: https://issues.apache.org/jira/browse/SPARK-3521
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Radim Kolar
Priority: Minor
 Fix For: 1.1.1


 modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from 
 source code distro, spark cant be build with maven. It cant be build by 
 {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: 
 impossible to get artifacts when data has not been loaded. IvyNode = 
 org.slf4j#slf4j-api;1.6.1_)
 (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.4.1 -DskipTests clean package
 [INFO] Scanning for projects...
 [ERROR] The build could not read 1 project - [Help 1]
 [ERROR]   
 [ERROR]   The project org.apache.spark:spark-parent:1.1.0 
 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3531) select null from table would throw a MatchError

2014-09-15 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-3531:
--

 Summary: select null from table would throw a MatchError
 Key: SPARK-3531
 URL: https://issues.apache.org/jira/browse/SPARK-3531
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


select null from src limit 1 will lead to a scala.MatchError



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3531) select null from table would throw a MatchError

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133728#comment-14133728
 ] 

Apache Spark commented on SPARK-3531:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2396

 select null from table would throw a MatchError
 ---

 Key: SPARK-3531
 URL: https://issues.apache.org/jira/browse/SPARK-3531
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 select null from src limit 1 will lead to a scala.MatchError



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133760#comment-14133760
 ] 

Sean Owen commented on SPARK-3530:
--

A few high-level questions:

Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the 
algorithms will come along, but in a fairly different form. I think that's 
actually a good thing. But is this targeted at a 2.x release, or sooner?

How does this relate to MLI and MLbase? I had thought they would in theory 
handle things like grid-search, but haven't seen activity or mention of these 
in a while. Is this at all a merge of the two or is MLlib going to take over 
these concerns?

I don't think you will need or want to use this code, but the oryx project 
already has an implementation of grid search on Spark. At least another take on 
the API for such a thing to consider. 
https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param

Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also 
intrigued by doing better than trying every possible combination of parameters 
separately, and maybe sharing partial results to speed up several models' 
training. Is this realistic for any parameters besides things like # 
iterations? which isn't really a hyperparam. I don't know, for example, ways to 
build N models with N different overfitting params and share some work. I would 
love to know that's possible. Good to design for it anyway.

I see mention of a Dataset abstraction, which I'm assuming contains some type 
information, like distinguishing categorical and numeric features. I think 
that's very good!

I've always found the 'pipeline' part hard to build. It's tempting to construct 
a framework for feature extraction. To some degree you can by providing 
transformations, 1-hot encoding, etc. But I think that a framework for 
understanding arbitrary databases and fields and so on quickly becomes too 
endlessly large a scope. Spark Core to me is already the right abstraction for 
upstream ETL of data before entering an ML framework.  I mention it just 
because it's in the first picture, but I don't see discussion of actually doing 
user/product attribute selection later. So maybe it's not meant to be part of 
the proposal. 

I'd certainly like to keep up more with your work here. This is a big step 
forward in making MLlib more relevant to production deployments rather than 
just pure algorithms implementations.

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133776#comment-14133776
 ] 

Apache Spark commented on SPARK-2594:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2397

 Add CACHE TABLE name AS SELECT ...
 

 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.

2014-09-15 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-3532:
--

 Summary: Spark On FreeBSD. Snappy used by torrent broadcast fails 
to load native libs.
 Key: SPARK-3532
 URL: https://issues.apache.org/jira/browse/SPARK-3532
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Priority: Minor


While trying out spark on freebsd, this seemed like first blocker. 

Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress  false




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.

2014-09-15 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3532:
---
Description: 
While trying out spark on freebsd, this seemed like first blocker. 

Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress  false
Even better workaround is set:
spark.io.compression.codec  lzf


  was:
While trying out spark on freebsd, this seemed like first blocker. 

Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress  false



 Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
 -

 Key: SPARK-3532
 URL: https://issues.apache.org/jira/browse/SPARK-3532
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Priority: Minor

 While trying out spark on freebsd, this seemed like first blocker. 
 Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress
 false
 Even better workaround is set:
 spark.io.compression.codeclzf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.

2014-09-15 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133821#comment-14133821
 ] 

Radim Kolar commented on SPARK-3532:


you need to grab snappy native library from _snappy-java_ freebsd port, its not 
included in maven central JAR.

{quote}
(hsn@sanatana:pts/8):~% pkg info -l snappyjava
snappyjava-1.0.4.1_1:
/usr/local/lib/libsnappyjava.so
/usr/local/share/java/classes/snappy-java.jar
{quote}

 Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
 -

 Key: SPARK-3532
 URL: https://issues.apache.org/jira/browse/SPARK-3532
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Priority: Minor

 While trying out spark on freebsd, this seemed like first blocker. 
 Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress
 false
 Even better workaround is set:
 spark.io.compression.codeclzf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal

2014-09-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3410.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

 The priority of shutdownhook for ApplicationMaster should not be integer 
 literal
 

 Key: SPARK-3410
 URL: https://issues.apache.org/jira/browse/SPARK-3410
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.2.0


 In ApplicationMaster, the priority of shutdown hook is set to 30, which 
 expects higher than the priority of o.a.h.FileSystem.
 In FileSystem, the priority of shutdown hook is expressed as public constant 
 named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant 
 for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3396) Change LogistricRegressionWithSGD's default regType to L2

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133917#comment-14133917
 ] 

Apache Spark commented on SPARK-3396:
-

User 'BigCrunsh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2398

 Change LogistricRegressionWithSGD's default regType to L2
 -

 Key: SPARK-3396
 URL: https://issues.apache.org/jira/browse/SPARK-3396
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Christoph Sawade

 The default updater is SimpleUpdater, which doesn't add any regularization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2014-09-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133927#comment-14133927
 ] 

Nicholas Chammas commented on SPARK-3528:
-

[~aash] - How about for data read from S3? I see that being marked as 
{{PROCESS_LOCAL}} as well. 

{code}
 sc.textFile('s3n://...').count()
14/09/15 10:12:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1242 bytes)
14/09/15 10:12:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1242 bytes)
{code}


 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3526) Docs section on data locality

2014-09-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133935#comment-14133935
 ] 

Nicholas Chammas commented on SPARK-3526:
-

FYI: Looks like the valid localities are [enumerated 
here|https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/core/src/main/scala/org/apache/spark/scheduler/TaskLocality.scala#L25].

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash
Assignee: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3470.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor
 Fix For: 1.2.0


 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1895) Run tests on windows

2014-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133964#comment-14133964
 ] 

Sean Owen commented on SPARK-1895:
--

Can anyone still reproduce this? I know test temp file cleanup was improved in 
1.0.x, and am not sure I have heard of this since.

 Run tests on windows
 

 Key: SPARK-1895
 URL: https://issues.apache.org/jira/browse/SPARK-1895
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Windows
Affects Versions: 0.9.1
 Environment: spark-0.9.1-bin-hadoop1
Reporter: stribog
Priority: Trivial

 bin\pyspark python\pyspark\rdd.py
 Sometimes tests complete without error _.
 Last tests fail log:
 
 14/05/21 18:31:40 INFO Executor: Running task ID 321
 14/05/21 18:31:40 INFO Executor: Running task ID 324
 14/05/21 18:31:40 INFO Executor: Running task ID 322
 14/05/21 18:31:40 INFO Executor: Running task ID 323
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 324
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on 
 localhost (progress: 1/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3)
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 323
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on 
 localhost (progress: 2/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 322
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on 
 localhost (progress: 3/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 321
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on 
 localhost (progress: 4/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks 
 have all completed, from pool
 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest 
 __main__.RDD.top[0]:1) finished in 1,051 s
 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest 
 __main__.RDD.top[0]:1, took 1.053832912 s
 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest 
 __main__.RDD.top[1]:1
 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest 
 __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false)
 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest 
 __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List()
 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List()
 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at 
 top at doctest __main__.RDD.top[1]:1), which has no missing parents
 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 
 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes 
 in 1 ms
 14/05/21 18:31:41 INFO Executor: Running task ID 326
 14/05/21 18:31:41 INFO Executor: 

[jira] [Resolved] (SPARK-1258) RDD.countByValue optimization

2014-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1258.
--
Resolution: Won't Fix

I'm taking the liberty of closing this, since this refers to an optimization 
using fastutil classes, which were removed from Spark. An equivalent 
optimization is employed now, using Spark's OpenHashMap.

 RDD.countByValue optimization
 -

 Key: SPARK-1258
 URL: https://issues.apache.org/jira/browse/SPARK-1258
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Jaroslav Kamenik
Priority: Trivial

 Class Object2LongOpenHashMap has method add(key, incr) (addTo in new version) 
 for incrementation value assigned to the key. It should be faster than 
 currently used  map.put(v, map.getLong(v) + 1L) .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest

2014-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133976#comment-14133976
 ] 

Sean Owen commented on SPARK-3506:
--

Yeah, I imagine that can be touched up right now. For the future, I imagine the 
issue was just that the site was built from the branch before the release 
plugin upped the version and created the artifacts? So the site might be better 
built from the final released source artifact.

I imagine it's a release-process doc change but don't know whether that lives.

 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
 --

 Key: SPARK-3506
 URL: https://issues.apache.org/jira/browse/SPARK-3506
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Jacek Laskowski
Priority: Trivial

 In https://spark.apache.org/docs/latest/ there are references to 
 1.1.0-SNAPSHOT:
 * This documentation is for Spark version 1.1.0-SNAPSHOT.
 * For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10.
 It should be version 1.1.0 since that's the latest released version and the 
 header tells so, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134096#comment-14134096
 ] 

Sean Owen commented on SPARK-2620:
--

FWIW, here is a mailing list comment that suggests 1.1 works with these case 
classes, although this is not a case where the REPL is being used:

http://apache-spark-user-list.1001560.n3.nabble.com/Compiler-issues-for-multiple-map-on-RDD-td14248.html

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-09-15 Thread Daniel Siegmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134101#comment-14134101
 ] 

Daniel Siegmann commented on SPARK-2620:


I have tested the case in spark-shell on Spark 1.1.0. It is still broken.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2014-09-15 Thread Gregory Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134073#comment-14134073
 ] 

Gregory Phillips edited comment on SPARK-2984 at 9/15/14 4:43 PM:
--

I'm running into this as well. But to respond to this theory:

{quote}
I think this may be related to spark.speculation. I think the error condition 
might manifest in this circumstance:
1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist! exception
{quote}

Speculation is not necessary for this to occur. I am consistently running into 
this while testing some code against local without speculation where I am 
trying to download, manipulate and merge 2 sets of data from S3 and serialize 
the resulting RDD using saveAsTextFile back to S3:

{code}
Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent 
failure: Exception failure in TID 762 on host localhost: 
java.io.FileNotFoundException: 
s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate:
 No such file or directory. 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786)
 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
org.apache.spark.scheduler.Task.run(Task.scala:51) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
java.lang.Thread.run(Thread.java:744) Driver stacktrace:
{code}

I'm happy to provide more information or help investigate further to figure 
this one out.

Edit: I forgot to mention that the file in question actually did exist on S3 
when I checked after receiving this exception.


was (Author: gphil):
I'm running into this as well. But to respond to this theory:

{quote}
I think this may be related to spark.speculation. I think the error condition 
might manifest in this circumstance:
1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist! exception
{quote}

Speculation is not necessary for this to occur. I am consistently running into 
this while testing some code against local without speculation where I am 
trying to download, manipulate and merge 2 sets of data from S3 and serialize 
the resulting RDD using saveAsTextFile back to S3:

{code}
Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent 
failure: Exception failure in TID 762 on host localhost: 
java.io.FileNotFoundException: 
s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate:
 No such file or directory. 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786)
 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
org.apache.spark.scheduler.Task.run(Task.scala:51) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
java.lang.Thread.run(Thread.java:744) Driver stacktrace:
{code}

I'm happy to provide more information or help investigate further to figure 
this one out.

 FileNotFoundException on _temporary directory
 

[jira] [Commented] (SPARK-2932) Move MasterFailureTest out of main source directory

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134116#comment-14134116
 ] 

Apache Spark commented on SPARK-2932:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2399

 Move MasterFailureTest out of main source directory
 -

 Key: SPARK-2932
 URL: https://issues.apache.org/jira/browse/SPARK-2932
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Marcelo Vanzin
Priority: Trivial

 Currently, MasterFailureTest.scala lives in streaming/src/main, which means 
 it ends up in the published streaming jar.
 It's only used by other test code, and although it also provides a main() 
 entry point, that's also only usable for testing, so the code should probably 
 be moved to the test directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1895) Run tests on windows

2014-09-15 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-1895:
--
Description: 
bin\pyspark python\pyspark\rdd.py

Sometimes tests complete without error _.
Last tests fail log:

{noformat}

14/05/21 18:31:40 INFO Executor: Running task ID 321
14/05/21 18:31:40 INFO Executor: Running task ID 324
14/05/21 18:31:40 INFO Executor: Running task ID 322
14/05/21 18:31:40 INFO Executor: Running task ID 323
14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, 
finish = 0
14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607
14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver
14/05/21 18:31:40 INFO Executor: Finished task ID 324
14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on localhost 
(progress: 1/4)
14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3)
14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, 
finish = 0
14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607
14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver
14/05/21 18:31:40 INFO Executor: Finished task ID 323
14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on localhost 
(progress: 2/4)
14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2)
14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, 
finish = 0
14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607
14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver
14/05/21 18:31:41 INFO Executor: Finished task ID 322
14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on localhost 
(progress: 3/4)
14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1)
14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, 
finish = 0
14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607
14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver
14/05/21 18:31:41 INFO Executor: Finished task ID 321
14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on localhost 
(progress: 4/4)
14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0)
14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks 
have all completed, from pool
14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest 
__main__.RDD.top[0]:1) finished in 1,051 s
14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest 
__main__.RDD.top[0]:1, took 1.053832912 s
14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest 
__main__.RDD.top[1]:1
14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest 
__main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false)
14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest 
__main__.RDD.top[1]:1)
14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List()
14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List()
14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at top 
at doctest __main__.RDD.top[1]:1), which has no missing parents
14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 
(PythonRDD[213] at top at doctest __main__.RDD.top[1]:1)
14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes in 
0 ms
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes in 
0 ms
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes in 
0 ms
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes in 
1 ms
14/05/21 18:31:41 INFO Executor: Running task ID 326
14/05/21 18:31:41 INFO Executor: Running task ID 328
14/05/21 18:31:41 INFO Executor: Running task ID 327
14/05/21 18:31:41 INFO Executor: Running task ID 325
14/05/21 18:31:41 INFO CacheManager: Partition rdd_212_3 not found, computing it
14/05/21 18:31:41 INFO MemoryStore: ensureFreeSpace(152) called with 
curMem=1120, maxMem=311387750
14/05/21 18:31:41 INFO MemoryStore: Block rdd_212_3 stored as values to memory 
(estimated size 152.0 B, free 297.0 MB)
14/05/21 18:31:41 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
rdd_212_3 in memory on stribog-pc:37187 (size: 152.0 B, free: 297.0 MB)
14/05/21 18:31:41 INFO 

[jira] [Updated] (SPARK-1764) EOF reached before Python server acknowledged

2014-09-15 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-1764:
--
Description: 
I'm getting EOF reached before Python server acknowledged while using PySpark 
on Mesos. The error manifests itself in multiple ways. One is:

{noformat}
14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed 
due to the error EOF reached before Python server acknowledged; shutting down 
SparkContext
{noformat}

And the other has a full stacktrace:

{noformat}
14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{noformat}

This error causes the SparkContext to shutdown. I have not been able to 
reliably reproduce this bug, it seems to happen randomly, but if you run enough 
tasks on a SparkContext it'll hapen eventually

  was:
I'm getting EOF reached before Python server acknowledged while using PySpark 
on Mesos. The error manifests itself in multiple ways. One is:

14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed 
due to the error EOF reached before Python server acknowledged; shutting down 
SparkContext

And the other has a full stacktrace:

14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at 

[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down

2014-09-15 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2586:
--
Description: 
When you running Spark with Tachyon, when the connection to Tachyon master is 
down (due to problem in network or the Master node is down) there is no clear 
log or error message to indicate it.

Here is sample stack running SparkTachyonPi example with Tachyon connecting:

{noformat}
14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a 
loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hsaputra)
14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
14/07/15 16:43:11 INFO Remoting: Starting remoting
14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = 
ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
http://10.64.5.148:53205
14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
http://office-5-148.pa.gopivotal.com:4040
2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
SCDynamicStore
14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/15 16:43:12 INFO SparkContext: Added JAR 
examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
with timestamp 1405467792813
14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
spark://henry-pivotal.local:7077...
14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
SparkTachyonPi.scala:43
14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
SparkTachyonPi.scala:43)
14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
SparkTachyonPi.scala:39), which has no missing parents
14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(MappedRDD[1] at map at SparkTachyonPi.scala:39)
14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140715164313-
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
app-20140715164313-/0 on 
worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
(office-5-148.pa.gopivotal.com:52519) with 8 cores
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 
8 cores, 512.0 MB RAM
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: 
app-20140715164313-/0 is now RUNNING
14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256]
 with ID 0
14/07/15 16:43:15 INFO TaskSetManager: Re-computing pending task lists.
14/07/15 16:43:15 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 
0: office-5-148.pa.gopivotal.com (PROCESS_LOCAL)
14/07/15 16:43:15 INFO 

[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down

2014-09-15 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2586:
--
Description: 
When you running Spark with Tachyon, when the connection to Tachyon master is 
down (due to problem in network or the Master node is down) there is no clear 
log or error message to indicate it.

Here is sample stack running SparkTachyonPi example with Tachyon connecting:

{noformat}
14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a 
loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hsaputra)
14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
14/07/15 16:43:11 INFO Remoting: Starting remoting
14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = 
ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
http://10.64.5.148:53205
14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
http://office-5-148.pa.gopivotal.com:4040
2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
SCDynamicStore
14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/15 16:43:12 INFO SparkContext: Added JAR 
examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
with timestamp 1405467792813
14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
spark://henry-pivotal.local:7077...
14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
SparkTachyonPi.scala:43
14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
SparkTachyonPi.scala:43)
14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
SparkTachyonPi.scala:39), which has no missing parents
14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(MappedRDD[1] at map at SparkTachyonPi.scala:39)
14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140715164313-
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
app-20140715164313-/0 on 
worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
(office-5-148.pa.gopivotal.com:52519) with 8 cores
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 
8 cores, 512.0 MB RAM
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: 
app-20140715164313-/0 is now RUNNING
14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256]
 with ID 0
14/07/15 16:43:15 INFO TaskSetManager: Re-computing pending task lists.
14/07/15 16:43:15 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 
0: office-5-148.pa.gopivotal.com (PROCESS_LOCAL)
14/07/15 16:43:15 INFO 

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134160#comment-14134160
 ] 

Andrew Ash commented on SPARK-1239:
---

For large statuses, would we expect that to exceed {{spark.akka.frameSize}} and 
cause the below exception?

{noformat}
2014-09-14T01:34:21.305 ERROR [spark-akka.actor.default-dispatcher-4] 
org.apache.spark.MapOutputTrackerMasterActor - Map output statuses were 
13920119 bytes which exceeds spark.akka.frameSize (10485760 bytes).
{noformat}

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134198#comment-14134198
 ] 

Patrick Wendell commented on SPARK-1239:


Yes, the current state of the art is to just increase the frame size.

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3425.

Resolution: Fixed

Fixed by:
https://github.com/apache/spark/pull/2301

 OpenJDK - when run with jvm 1.8, should not set MaxPermSize
 ---

 Key: SPARK-3425
 URL: https://issues.apache.org/jira/browse/SPARK-3425
 Project: Spark
  Issue Type: Improvement
Reporter: Matthew Farrellee
Assignee: Matthew Farrellee
Priority: Minor
 Fix For: 1.2.0


 In JVM 1.8.0, MaxPermSize is no longer supported.
 In spark stderr output, there would be a line of
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
 support was removed in 8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2377) Create a Python API for Spark Streaming

2014-09-15 Thread Jyotiska NK (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134210#comment-14134210
 ] 

Jyotiska NK edited comment on SPARK-2377 at 9/15/14 6:00 PM:
-

I have been watching the work going on PR #11 for a while. Is there any way to 
contribute to this?


was (Author: jyotiska):
I have been watching the work going on PR #11. Is there any way to contribute 
to this?

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Kenichi Takagiwa

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming

2014-09-15 Thread Jyotiska NK (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134210#comment-14134210
 ] 

Jyotiska NK commented on SPARK-2377:


I have been watching the work going on PR #11. Is there any way to contribute 
to this?

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Kenichi Takagiwa

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming

2014-09-15 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134225#comment-14134225
 ] 

Matthew Farrellee commented on SPARK-2377:
--

it's a little tricky. you need to clone tdas' or giwa's repository, make 
changes on master (it's far from current spark master) and submit pull requests 
to giwa or tdas.

imho, it'd be much simpler if the PR was tagged [WIP] and directed toward the 
apache/spark repo! (pls!)

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Kenichi Takagiwa

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134246#comment-14134246
 ] 

Derrick Burns commented on SPARK-2308:
--

I have implemented MiniBatch KMeans in Spark.  

We do not need a special iterator type or random access to get the advantages 
of MiniBatch because the gain comes primarily from decreasing the number of 
distance calculations, and not from decreasing the number of points that are 
touched.  

MiniBatch is good if the number of points is dramatically larger than the 
number of clusters.  In that case, any sampling of points will impact a large 
number of clusters, leading to faster convergence. MiniBatch is less useful 
when the number of desired clusters is large. In this case, MiniBatch is less 
useful.  

A better approach is to track which clusters are dirty and which points are 
assigned to which clusters. Using this information, one can eliminate more and 
more distance calculations per round.  This leads to shorter and shorter 
rounds, and consequently faster convergence. 


 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: RJ Nowling
Priority: Minor
 Attachments: many_small_centers.pdf, uneven_centers.pdf


 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-09-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134264#comment-14134264
 ] 

RJ Nowling commented on SPARK-2308:
---

It is true that we will save on the distance calculations for high dimensional 
data sets.  There is also work under way to improve sampling in Spark, so this 
will also benefit further from that.

Are you planning on creating a PR for your implementation?  It would be 
valuable for the community.  I closed mine due to the sampling issues.  But I'd 
be happy to review and test yours.

 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: RJ Nowling
Priority: Minor
 Attachments: many_small_centers.pdf, uneven_centers.pdf


 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs

2014-09-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3534:
---

 Summary: Avoid running MLlib and Streaming tests when testing SQL 
PRs
 Key: SPARK-3534
 URL: https://issues.apache.org/jira/browse/SPARK-3534
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, SQL
Reporter: Michael Armbrust
Priority: Blocker


We are bumping up against the 120 minute time limit for tests pretty regularly 
now.  Since we have decreased the number of shuffle partitions and up-ed the 
parallelism I don't think there is much low hanging fruit to speed up the SQL 
tests. (The tests that are listed as taking 2-3 minutes are actually 100s of 
tests that I think are valuable).  Instead I propose we avoid running tests 
that we don't need to.

This will have the added benefit of eliminating failures in SQL due to flaky 
streaming tests.

Note that this won't fix the full builds that are run for every commit.  There 
I think we just just up the test timeout.

cc: [~joshrosen] [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs

2014-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134290#comment-14134290
 ] 

Josh Rosen commented on SPARK-3534:
---

Looks like this has been proposed before: SPARK-1455

 Avoid running MLlib and Streaming tests when testing SQL PRs
 

 Key: SPARK-3534
 URL: https://issues.apache.org/jira/browse/SPARK-3534
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, SQL
Reporter: Michael Armbrust
Priority: Blocker

 We are bumping up against the 120 minute time limit for tests pretty 
 regularly now.  Since we have decreased the number of shuffle partitions and 
 up-ed the parallelism I don't think there is much low hanging fruit to speed 
 up the SQL tests. (The tests that are listed as taking 2-3 minutes are 
 actually 100s of tests that I think are valuable).  Instead I propose we 
 avoid running tests that we don't need to.
 This will have the added benefit of eliminating failures in SQL due to flaky 
 streaming tests.
 Note that this won't fix the full builds that are run for every commit.  
 There I think we just just up the test timeout.
 cc: [~joshrosen] [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134291#comment-14134291
 ] 

Derrick Burns commented on SPARK-2308:
--

I have submitted several issues regarding the Spark K-Means clustering
implementation.  I would like to contribute my new K-Means implementation
to Spark. However, my implementation is a re-write, and not simply a
trivial pull-request.

Why did I re-write it?

The initial reason for rewriting the code is that the 1.0.2 (and 1.1.0)
implementation does not support the distance function that I wanted.  I use
the KL-Divergence measure, whereas yours uses the squared Euclidean
distance.  Unfortunately, the Euclidean metric is to deeply ingrained in
the 1.0.2 implementation, that a major re-write was required.

My implementation has a pluggable distance function.  I wrote an
implementation of the squared Euclidean distance function (actually I wrote
two versions), and an implementation of KL-divergence. With this
abstraction, the core K-Means algorithm knows nothing about the distance
function.

The second reason is that I noticed that the 1.0.2. implementation is
rather slow because it recomputes all distances to all points on all
rounds.  Further, on my experiments on millions of points and thousands of
cluster with 100s of dimensions, the K-Means algorithm is CPU bound with
very very little memory being used.

I observed that we can eliminate the vast majority of distance calculations
by maintaining for each point its closest cluster, the distance to that
cluster, and the round that either of those values was changed and for each
cluster which round that cluster was last moved. This is a modest amount of
addition space, but results in a dramatic improvement in running time.

I have tested my new implementation extensively on EC2 using  up to 16
c3.8xlarge worker machines.  The application is in financial services, so I
need the results to be right. :)

I call my implementation Tracking K-Means.  If you would entertain
including my implementation in a future Spark release, please let me know.




 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: RJ Nowling
Priority: Minor
 Attachments: many_small_centers.pdf, uneven_centers.pdf


 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2014-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1517:
--
Fix Version/s: (was: 1.1.0)
   1.2.0

We should revisit this for the 1.2.0 release cycle, since this would have 
solved the issue that we ran into with bumping the SNAPSHOT version before the 
1.1 artifacts were published on Maven.

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell
 Fix For: 1.2.0


 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3104) Jenkins failing to test some PRs when asked to

2014-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3104.
---
Resolution: Cannot Reproduce

Resolving this as cannot reproduce for now, since Jenkins seems to have 
become responsive again.  Feel free to re-open if it begins experiencing 
issues.  We have an alternative to the pull request builder plugin ready to 
deploy if it breaks again.

 Jenkins failing to test some PRs when asked to
 --

 Key: SPARK-3104
 URL: https://issues.apache.org/jira/browse/SPARK-3104
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Erik Erlandson

 I've seen a few PRs where Jenkins does not appear to be testing when 
 requested:
 https://github.com/apache/spark/pull/1964
 https://github.com/apache/spark/pull/1254
 https://github.com/apache/spark/pull/1839
 Maybe the Jenkins logs have a record of what's going wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-09-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134301#comment-14134301
 ] 

RJ Nowling commented on SPARK-2308:
---

I'm not a committer but [~mengxr] is.  That said, I'm very happy to help in any 
way I can.

The issue of different distance metrics has come up on the mailing list -- a 
must requested feature.  If you provide it as a PR, maybe others who are more 
familiar with the work to add additional distance metrics can comment and we, 
as a community, can move forward to get it included.

 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: RJ Nowling
Priority: Minor
 Attachments: many_small_centers.pdf, uneven_centers.pdf


 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2232) Fix Jenkins tests in Maven

2014-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2232.
---
Resolution: Fixed

This has been fixed; the Maven builds have now been green for a few days.

 Fix Jenkins tests in Maven
 --

 Key: SPARK-2232
 URL: https://issues.apache.org/jira/browse/SPARK-2232
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Patrick Wendell
Priority: Blocker

 It appears Maven tests are failing under the newer Hadoop configurations. We 
 need to go through and make sure all the Spark master build configurations 
 are passing.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2014-09-15 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3533:

Affects Version/s: 1.1.0

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns commented on SPARK-3219:
--

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:14 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:15 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation.

{code}
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Point (P), Center (C), and 
Centroid.  Then, one
can implementation a distance function Trait (called PointOps below) in a way 
that allows the implementer to pre-compute values for Point and Center, such as 
is hard-coded for the fast squared Euclidean distance function in the 1.0.2 
K-Means implementation.  Since the representation of Point and Center is 
abstracted, the implementer of the trait can use JBlas, Breeze, or whatever 
math library is preferred, again, without touching the generic K-Means 
implementation. 

Additionally, one can abstract the Distance (Float or Double) and the user data 
point T. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values as is done with the Fast Euclidean distance 
which pre-computes magnitudes.  (With more complex distance functions such as 
the Kullback-Leibler function, one can pre-compute the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:24 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points.)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (e.g. Float or Double), T 
(the input data type of a point), P (a point as represented by the distance 
function), C (a cluster center as represented by the distance function), and 
Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---

The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }

  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}


was (Author: derrickburns):
The key abstractions that need to be added to the K-Means implementation to 
support interesting distance functions are: Distance (the type used to 
represent distance, such as Float or Double), T (the data type used for a point 
by the K-Means client), P (the data type used for a point by the distance 
function), C (the data type used for a cluster center by the distance 
function), and Centroid.  

By separating the user type T from the types P (point) and C (center), one can 
do things like pre-compute values (as is done with the Fast Euclidean distance 
in the 1.0.2 implementation that pre-computes magnitudes).  (With more complex 
distance functions such as the Kullback-Leibler function, one can pre-compute 
the log of the points, which is too expensive to re-compute in the distance 
calculation!)

Further, since the representation of point and center is abstracted, the 
implementer of the trait can use JBlas, Breeze, or whatever math library is 
preferred, again, without touching the generic K-Means implementation. 

{code}

  type Distance = Double

  trait FP[T] extends Serializable {
val weight: Distance
val index: Option[T]
val raw : Array[Distance]
  }
  trait PointOps[P : FP[T], C : FP[T], T] {
def distance(p: P, c: C, upperBound: Distance): Distance

def userToPoint(v: Array[Double], index: Option[T]): P

def centerToPoint(v: C): P

def pointToCenter(v: P): C

def centroidToCenter(v: Centroid): C

def centroidToPoint(v: Centroid): P

def centerMoved(v: P, w: C): Boolean

  }
{code}

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-15 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120325#comment-14120325
 ] 

Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM:
---

Great!

You can find my work here:
https://github.com/derrickburns/generalized-kmeans-clustering.git.

I should warn you that I rewrote much of the original Spark clusterer
because the original is too tightly coupled to using the Euclidean norm and
does not
allow one to identify efficiently which points belong to which clusters.  I
have tested this version extensively.

You will notice a package call com.rincaro.clusterer.metrics.  Please take
a look at the two files EuOps.scala and FastEuclideansOps.scala.   They
both implement the Euclidean norm. However, one is much faster than the
other by using the same algebraic transformations that the Spark version
uses.  This demonstrates that
it is possible to be efficient while not being tightly coupled.   One could
easily re-implement FastEuclideanOps using Breeze or Blas without affecting
the core Kmeans implementation.

Not included in this project is another distance function that that I have
implemented: the Kullback-Leibler distance function, a.k.a. relative
entropy.  In my implementation, I also perform algebraic transformations to
expedite the computation, resulting in a distance computation that is even
faster than the fast euclidean norm.

Let me know if this is useful to you.






was (Author: derrickburns):
Great!

You can find my work here:
https://github.com/derrickburns/generalized-kmeans-clustering.git.

I should warn you that I rewrote much of the original Spark clusterer
because the original is too tightly coupled to using the Euclidean norm and
does not
allow one to identify efficiently which points belong to which clusters.  I
have tested this version extensively.

You will notice a package call com.rincaro.clusterer.metrics.  Please take
a look at the two files EuOps.scala and FastEuclideansOps.scala.   They
both implement the Euclidean norm. However, one is much faster than the
other by using the same algebraic transformations that the Spark version
uses.  This demonstrates that
it is possible to be efficient while not being tightly coupled.   One could
easily re-implement FastEuclideanOps using Breeze or Blas without effecting
the core Kmeans implementation.

Not included in this project is another distance function that that I have
implemented: the Kullback-Leibler distance function, a.k.a. relative
entropy.  In my implementation, I also perform algebraic transformations to
expedite the computation, resulting in a distance computation that is even
faster than the fast euclidean norm.

Let me know if this is useful to you.





 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3308) Ability to read JSON Arrays as tables

2014-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134385#comment-14134385
 ] 

Apache Spark commented on SPARK-3308:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/2400

 Ability to read JSON Arrays as tables
 -

 Key: SPARK-3308
 URL: https://issues.apache.org/jira/browse/SPARK-3308
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Critical

 Right now we can only read json where each object is on its own line.  It 
 would be nice to be able to read top level json arrays where each element 
 maps to a row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs

2014-09-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134399#comment-14134399
 ] 

Michael Armbrust commented on SPARK-3534:
-

Ah, yeah.  Good catch Josh.  I'll settle for a simpler version where we only 
fix SQL (since that is where the pain is being felt now).  Bonus points for 
implementing the full solution :)

 Avoid running MLlib and Streaming tests when testing SQL PRs
 

 Key: SPARK-3534
 URL: https://issues.apache.org/jira/browse/SPARK-3534
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, SQL
Reporter: Michael Armbrust
Priority: Blocker

 We are bumping up against the 120 minute time limit for tests pretty 
 regularly now.  Since we have decreased the number of shuffle partitions and 
 up-ed the parallelism I don't think there is much low hanging fruit to speed 
 up the SQL tests. (The tests that are listed as taking 2-3 minutes are 
 actually 100s of tests that I think are valuable).  Instead I propose we 
 avoid running tests that we don't need to.
 This will have the added benefit of eliminating failures in SQL due to flaky 
 streaming tests.
 Note that this won't fix the full builds that are run for every commit.  
 There I think we just just up the test timeout.
 cc: [~joshrosen] [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-15 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134445#comment-14134445
 ] 

Pedro Rodriguez commented on SPARK-1405:


Hi All. Just wanted to quickly introduce myself quickly. I am undergrad at UC 
Berkeley working in the Amplab and in particular with LDA (continuation of a 
grad class final project from last spring).

Generally speaking my focus will be to use one LDA implementation as a baseline 
(probably Joey's since it is fully distributed in all parts, particularly the 
token-topic matrix), write unit tests + test cases, and benchmark it at scale.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-15 Thread Brenden Matthews (JIRA)
Brenden Matthews created SPARK-3535:
---

 Summary: Spark on Mesos not correctly setting heap overhead
 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews


Spark on Mesos does account for any memory overhead.  The result is that tasks 
are OOM killed nearly 95% of the time.

Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
executor memory for JVM overhead.

For example, see: 
https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-15 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134499#comment-14134499
 ] 

Timothy St. Clair commented on SPARK-3535:
--

Are you seeing this under fine grained, course grained, or both.  

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-15 Thread Brenden Matthews (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134524#comment-14134524
 ] 

Brenden Matthews commented on SPARK-3535:
-

I'm seeing this in fine grained mode. I confirmed that Spark sets the heap size 
to equal to the task memory.

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3536) SELECT on empty parquet table throws exception

2014-09-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3536:
---

 Summary: SELECT on empty parquet table throws exception
 Key: SPARK-3536
 URL: https://issues.apache.org/jira/browse/SPARK-3536
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


Reported by [~matei].  Reproduce as follows:
{code}
scala case class Data(i: Int)
defined class Data

scala createParquetFile[Data](testParquet)
scala parquetFile(testParquet).count()
14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due to 
exception - job: 0
java.lang.NullPointerException
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-15 Thread Brenden Matthews (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134533#comment-14134533
 ] 

Brenden Matthews commented on SPARK-3535:
-

I wrote a patch:

https://github.com/apache/spark/pull/2401

I'm not very familiar with Spark internals or conventions, so feel free to rip 
this apart.

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3537) Statistics for cached RDDs

2014-09-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3537:
---

 Summary: Statistics for cached RDDs
 Key: SPARK-3537
 URL: https://issues.apache.org/jira/browse/SPARK-3537
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust


Right now we only have limited statistics for hive tables.  We could easily 
collect this data when caching an RDD as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3538) Provide way for workers to log messages to driver's out/err

2014-09-15 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created SPARK-3538:


 Summary: Provide way for workers to log messages to driver's 
out/err
 Key: SPARK-3538
 URL: https://issues.apache.org/jira/browse/SPARK-3538
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, Spark Shell
Reporter: Matthew Farrellee
Priority: Minor


As part of SPARK-927 we encountered a use case for the code running on a worker 
to be able to emit messages back to the driver. The communication channel is 
for trace/debug messages to an application's (shell or app) user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134592#comment-14134592
 ] 

Andrew Ash commented on SPARK-3535:
---

Why does the task need extra memory if the heap size equals the available task 
memory?  Filesystem cache?

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-15 Thread Brenden Matthews (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134599#comment-14134599
 ] 

Brenden Matthews commented on SPARK-3535:
-

The JVM heap size does not include the stack size, GC overhead, or anything 
malloc'd by other libraries linked into the JVM.

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3518) Remove useless statement in JsonProtocol

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3518.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: Kousuke Saruta

Fixed by:
https://github.com/apache/spark/pull/2380

 Remove useless statement in JsonProtocol
 

 Key: SPARK-3518
 URL: https://issues.apache.org/jira/browse/SPARK-3518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named 
 accumUpdateMap is created as follows.
 {code}
 val accumUpdateMap = taskInfo.accumulables
 {code}
 But accumUpdateMap is never used and there is 2nd invocation of 
 taskInfo.accumlables as follows.
 {code}
 (Accumulables - 
 JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2505) Weighted Regularizer

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2505:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Weighted Regularizer
 

 Key: SPARK-2505
 URL: https://issues.apache.org/jira/browse/SPARK-2505
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
 Fix For: 1.2.0


 The current implementation of regularization in linear model is using 
 `Updater`, and this design has couple issues as the following.
 1) It will penalize all the weights including intercept. In machine learning 
 training process, typically, people don't penalize the intercept. 
 2) The `Updater` has the logic of adaptive step size for gradient decent, and 
 we would like to clean it up by separating the logic of regularization out 
 from updater to regularizer so in LBFGS optimizer, we don't need the trick 
 for getting the loss and gradient of objective function.
 In this work, a weighted regularizer will be implemented, and users can 
 exclude the intercept or any weight from regularization by setting that term 
 with zero weighted penalty. Since the regularizer will return a tuple of loss 
 and gradient, the adaptive step size logic, and soft thresholding for L1 in 
 Updater will be moved to SGD optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2314) RDD actions are only overridden in Scala, not java or python

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2314:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 RDD actions are only overridden in Scala, not java or python
 

 Key: SPARK-2314
 URL: https://issues.apache.org/jira/browse/SPARK-2314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Aaron Staple
  Labels: starter
 Fix For: 1.2.0, 1.0.3


 For example, collect and take().  We should keep these two in sync, or move 
 this code to schemaRDD like if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3403:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.2.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2703:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Make Tachyon related unit tests execute without deploying a Tachyon system 
 locally.
 ---

 Key: SPARK-2703
 URL: https://issues.apache.org/jira/browse/SPARK-2703
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Haoyuan Li
Assignee: Rong Gu
 Fix For: 1.2.0


 Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3038) delete history server logs when there are too many logs

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3038:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 delete history server logs when there are too many logs 
 

 Key: SPARK-3038
 URL: https://issues.apache.org/jira/browse/SPARK-3038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: wangfei
 Fix For: 1.2.0


 enhance history server to delete logs automatically
 1 use spark.history.deletelogs.enable to enable this function
 2 when app logs num is greater than spark.history.maxsavedapplication, delete 
 the older logs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1832) Executor UI improvement suggestions

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1832:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Executor UI improvement suggestions
 ---

 Key: SPARK-1832
 URL: https://issues.apache.org/jira/browse/SPARK-1832
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Thomas Graves
 Fix For: 1.2.0


 I received some suggestions from a user for the /executors UI page to make it 
 more helpful. This gets more important when you have a really large number of 
 executors.
  Fill some of the cells with color in order to make it easier to absorb 
 the info, e.g.
 RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
 the red)
 GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
 number)
 Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
 on the log(# completed)
 - if dark blue then write the value in white (same for the RED and GREEN above
 Maybe mark the MASTER task somehow
  
 Report the TOTALS in each column (do this at the TOP so no need to scroll 
 to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2167) spark-submit should return exit code based on failure/success

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2167:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 spark-submit should return exit code based on failure/success
 -

 Key: SPARK-2167
 URL: https://issues.apache.org/jira/browse/SPARK-2167
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Guoqiang Li
 Fix For: 1.2.0


 spark-submit script and Java class should exit with 0 for success and 
 non-zero with failure so that other command line tools and workflow managers 
 (like oozie) can properly tell if the spark app succeeded or failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2754) Document standalone-cluster mode now that it's working

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2754:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Document standalone-cluster mode now that it's working
 --

 Key: SPARK-2754
 URL: https://issues.apache.org/jira/browse/SPARK-2754
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.1
Reporter: Andrew Or
 Fix For: 1.2.0


 This was previously broken before SPARK-2260, so we (attempted to) remove all 
 documentation related to this mode. We should add it back now that we have 
 fixed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2947) DAGScheduler resubmit the stage into an infinite loop

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2947:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 DAGScheduler resubmit the stage into an infinite loop
 -

 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker
 Fix For: 1.2.0, 1.0.3


 Stage to resubmit more than 5 times.
 This seems to be caused by {{FetchFailed.bmAddress}} is null .
 I don't know how to reproduce it.
 master log:
 {noformat}
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
 TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
 TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
 1.189:141)
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
  -- 5 times ---
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 1.189, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
 ms on jilin (progress: 280/280)
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
 269)
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 2.1, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
 DealCF.scala:207) finished in 129.544 s
 {noformat}
 worker: log
 {noformat}
 /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18017
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18285
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18419
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 

[jira] [Updated] (SPARK-2638) Improve concurrency of fetching Map outputs

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2638:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Improve concurrency of fetching Map outputs
 ---

 Key: SPARK-2638
 URL: https://issues.apache.org/jira/browse/SPARK-2638
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: All
Reporter: Stephen Boesch
Assignee: Josh Rosen
Priority: Minor
  Labels: MapOutput, concurrency
 Fix For: 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 This issue was noticed while perusing the MapOutputTracker source code. 
 Notice that the synchronization is on the containing fetching collection - 
 which makes ALL fetches wait if any fetch were occurring.  
 The fix is to synchronize instead on the shuffleId (interned as a string to 
 ensure JVM wide visibility).
   def getServerStatuses(shuffleId: Int, reduceId: Int): 
 Array[(BlockManagerId, Long)] = {
 val statuses = mapStatuses.get(shuffleId).orNull
 if (statuses == null) {
   logInfo(Don't have map outputs for shuffle  + shuffleId + , fetching 
 them)
   var fetchedStatuses: Array[MapStatus] = null
   fetching.synchronized {   // This is existing code
  //  shuffleId.toString.intern.synchronized {  // New Code
 if (fetching.contains(shuffleId)) {
   // Someone else is fetching it; wait for them to be done
   while (fetching.contains(shuffleId)) {
 try {
   fetching.wait()
 } catch {
   case e: InterruptedException =
 }
   }
 This is only a small code change, but the testcases to prove (a) proper 
 functionality and (b) proper performance improvement are not so trivial.  
 For (b) it is not worthwhile to add a testcase to the codebase. Instead I 
 have added a git project that demonstrates the concurrency/performance 
 improvement using the fine-grained approach . The github project is at
 https://github.com/javadba/scalatesting.git  .  Simply run sbt test. Note: 
 it is unclear how/where to include this ancillary testing/verification 
 information that will not be included in the git PR: i am open for any 
 suggestions - even as far as simply removing references to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1911:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Warn users if their assembly jars are not built with Java 6
 ---

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.2.0


 The root cause of the problem is detailed in: 
 https://issues.apache.org/jira/browse/SPARK-1520.
 In short, an assembly jar built with Java 7+ is not always accessible by 
 Python or other versions of Java (especially Java 6). If the assembly jar is 
 not built on the cluster itself, this problem may manifest itself in strange 
 exceptions that are not trivial to debug. This is an issue especially for 
 PySpark on YARN, which relies on the python files included within the 
 assembly jar.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. At the very least we need to emphasize this in the docs 
 (currently missing entirely). The next step is to add a warning prompt in the 
 mvn scripts whenever Java 7+ is detected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2793) Correctly lock directory creation in DiskBlockManager.getFile

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2793:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Correctly lock directory creation in DiskBlockManager.getFile
 -

 Key: SPARK-2793
 URL: https://issues.apache.org/jira/browse/SPARK-2793
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Matei Zaharia
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2069) MIMA false positives (umbrella)

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2069:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 MIMA false positives (umbrella)
 ---

 Key: SPARK-2069
 URL: https://issues.apache.org/jira/browse/SPARK-2069
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 Since we started using MIMA more actively in core we've been running into 
 situations were we get false positives. We should address these ASAP as they 
 require having manual excludes in our build files which is pretty tedious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1830:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
 ---

 Key: SPARK-1830
 URL: https://issues.apache.org/jira/browse/SPARK-1830
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Prashant Sharma
 Fix For: 1.0.1, 1.2.0


 With current code base it is difficult to plugin an external user specified 
 Persistence Engine or Election Agent. It would be good to expose this as 
 a pluggable API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2795) Improve DiskBlockObjectWriter API

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2795:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Improve DiskBlockObjectWriter API
 -

 Key: SPARK-2795
 URL: https://issues.apache.org/jira/browse/SPARK-2795
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Matei Zaharia
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1706) Allow multiple executors per worker in Standalone mode

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1706:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Allow multiple executors per worker in Standalone mode
 --

 Key: SPARK-1706
 URL: https://issues.apache.org/jira/browse/SPARK-1706
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Patrick Wendell
Assignee: Nan Zhu
 Fix For: 1.2.0


 Right now if people want to launch multiple executors on each machine they 
 need to start multiple standalone workers. This is not too difficult, but it 
 means you have extra JVM's sitting around.
 We should just allow users to set a number of cores they want per-executor in 
 standalone mode and then allow packing multiple executors on each node. This 
 would make standalone mode more consistent with YARN in the way you request 
 resources.
 It's not too big of a change as far as I can see. You'd need to:
 1. Introduce a configuration for how many cores you want per executor.
 2. Change the scheduling logic in Master.scala to take this into account.
 3. Change CoarseGrainedSchedulerBackend to not assume a 1-1 correspondence 
 between hosts and executors.
 And maybe modify a few other places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1989:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Exit executors faster if they get into a cycle of heavy GC
 --

 Key: SPARK-1989
 URL: https://issues.apache.org/jira/browse/SPARK-1989
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia
 Fix For: 1.2.0


 I've seen situations where an application is allocating too much memory 
 across its tasks + cache to proceed, but Java gets into a cycle where it 
 repeatedly runs full GCs, frees up a bit of the heap, and continues instead 
 of giving up. This then leads to timeouts and confusing error messages. It 
 would be better to crash with OOM sooner. The JVM has options to support 
 this: http://java.dzone.com/articles/tracking-excessive-garbage.
 The right solution would probably be:
 - Add some config options used by spark-submit to set XX:GCTimeLimit and 
 XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% 
 time limit, 5% free limit)
 - Make sure we pass these into the Java options for executors in each 
 deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1860:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Standalone Worker cleanup should not clean up running executors
 ---

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.2.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 executors that happen to be running for longer than 7 days, hitting streaming 
 jobs especially hard.
 Executor's log/data folders should not be cleaned up if they're still 
 running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1924) Make local:/ scheme work in more deploy modes

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1924:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Make local:/ scheme work in more deploy modes
 -

 Key: SPARK-1924
 URL: https://issues.apache.org/jira/browse/SPARK-1924
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Minor
 Fix For: 1.2.0


 A resource marked local:/ is assumed to be available on the local file 
 system of every node. In such case, no data copying over network should 
 happen. I tested different deploy modes in v1.0, right now we only support 
 local:/ scheme for the app jar and secondary jars in the following modes:
 1) local (jars are copied to the working directory)
 2) standalone client
 3) yarn client
 It doesn’t work for --files and python apps (--py-files and app.py). For the 
 next release, we could support more deploy modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1379) Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1379:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Calling .cache() on a SchemaRDD should do something more efficient than 
 caching the individual row objects.
 ---

 Key: SPARK-1379
 URL: https://issues.apache.org/jira/browse/SPARK-1379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.2.0


 Since rows aren't black boxes we could use InMemoryColumnarTableScan.  This 
 would significantly reduce GC pressure on the workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1684:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
 Fix For: 1.2.0


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1853:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Mubarak Seyed
 Fix For: 1.2.0

 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1201) Do not materialize partitions whenever possible in BlockManager

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1201:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Do not materialize partitions whenever possible in BlockManager
 ---

 Key: SPARK-1201
 URL: https://issues.apache.org/jira/browse/SPARK-1201
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.2.0


 This is a slightly more complex version of SPARK-942 where we try to avoid 
 unrolling iterators in other situations where it is possible. SPARK-942 
 focused on the case where the DISK_ONLY storage level was used. There are 
 other cases though, such as if data is stored serialized and in memory and 
 but there is not enough memory left to store the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2159) Spark shell exit() does not stop SparkContext

2014-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2159:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

 Spark shell exit() does not stop SparkContext
 -

 Key: SPARK-2159
 URL: https://issues.apache.org/jira/browse/SPARK-2159
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.2.0


 If you type exit() in spark shell, it is equivalent to a Ctrl+C and does 
 not stop the SparkContext. This is used very commonly to exit a shell, and it 
 would be good if it is equivalent to Ctrl+D instead, which does stop the 
 SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >