[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-09-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14359


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-09-20 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r79679122
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -161,31 +161,42 @@ private[spark] object RandomForest extends Logging {
   None
 }
 
-// FIFO queue of nodes to train: (treeIndex, node)
-val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
+/*
+  Stack of nodes to train: (treeIndex, node)
+  The reason this is FILO is that we train many trees at once, but we 
want to focus on
--- End diff --

"The reason this is FILO" -> "The reason we use a stack"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-08-18 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r75361458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -161,31 +161,43 @@ private[spark] object RandomForest extends Logging {
   None
 }
 
-// FIFO queue of nodes to train: (treeIndex, node)
-val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
+/*
+  FILO queue of nodes to train: (treeIndex, node)
+  We make this FILO by always inserting nodes by appending (+=) and 
removing with dropRight.
--- End diff --

the methods described here still refer to the original queue data structure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r72338956
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
 }
   }
 
+  /**
+   * Class for queueing nodes to split on each iteration.
+   * This is a FILO queue.
+   */
+  private[impl] class NodeQueue {
+private var q: mutable.MutableList[(Int, LearningNode)] =
+  mutable.MutableList.empty[(Int, LearningNode)]
+
+/** Put (treeIndex, node) into queue. Constant time. */
+def put(treeIndex: Int, node: LearningNode): Unit = {
+  q += ((treeIndex, node))
+}
+
+/** Remove and return last inserted element.  Linear time (unclear in 
Scala docs though) */
--- End diff --

I also somehow didn't find Stack via Google...  Does anyone know of 
documentation for the performance of Stack?  I don't see it in the Scala docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r72338537
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
 }
   }
 
+  /**
+   * Class for queueing nodes to split on each iteration.
+   * This is a FILO queue.
+   */
+  private[impl] class NodeQueue {
+private var q: mutable.MutableList[(Int, LearningNode)] =
+  mutable.MutableList.empty[(Int, LearningNode)]
+
+/** Put (treeIndex, node) into queue. Constant time. */
+def put(treeIndex: Int, node: LearningNode): Unit = {
+  q += ((treeIndex, node))
+}
+
+/** Remove and return last inserted element.  Linear time (unclear in 
Scala docs though) */
--- End diff --

Whoops, I missed that List.tail is O(1) time.  I switched to a List.  
Thanks all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-26 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r72322067
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
 }
   }
 
+  /**
+   * Class for queueing nodes to split on each iteration.
+   * This is a FILO queue.
+   */
+  private[impl] class NodeQueue {
+private var q: mutable.MutableList[(Int, LearningNode)] =
+  mutable.MutableList.empty[(Int, LearningNode)]
+
+/** Put (treeIndex, node) into queue. Constant time. */
+def put(treeIndex: Int, node: LearningNode): Unit = {
+  q += ((treeIndex, node))
+}
+
+/** Remove and return last inserted element.  Linear time (unclear in 
Scala docs though) */
--- End diff --

How critical is this for performance? I know that `append` is O(1) for 
scala mutable lists but `dropRight(1)` is actually implemented in a parent 
class [1] of `MutableList` and therefores does not take advantage of the list's 
reference to its last element. All in all, from it's implementation it seems 
that `dropRight` is O(n)

[1] 
https://github.com/scala/scala/blob/v2.11.8/src/library/scala/collection/LinearSeqOptimized.scala#L194-L204


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r72317443
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
 }
   }
 
+  /**
+   * Class for queueing nodes to split on each iteration.
+   * This is a FILO queue.
+   */
+  private[impl] class NodeQueue {
+private var q: mutable.MutableList[(Int, LearningNode)] =
+  mutable.MutableList.empty[(Int, LearningNode)]
+
+/** Put (treeIndex, node) into queue. Constant time. */
+def put(treeIndex: Int, node: LearningNode): Unit = {
+  q += ((treeIndex, node))
+}
+
+/** Remove and return last inserted element.  Linear time (unclear in 
Scala docs though) */
--- End diff --

Is there some reason not to use `scala.collection.mutable.Stack[(Int, 
LearningNode)]` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-26 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r72197712
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
 }
   }
 
+  /**
+   * Class for queueing nodes to split on each iteration.
+   * This is a FILO queue.
+   */
+  private[impl] class NodeQueue {
+private var q: mutable.MutableList[(Int, LearningNode)] =
+  mutable.MutableList.empty[(Int, LearningNode)]
+
+/** Put (treeIndex, node) into queue. Constant time. */
+def put(treeIndex: Int, node: LearningNode): Unit = {
+  q += ((treeIndex, node))
+}
+
+/** Remove and return last inserted element.  Linear time (unclear in 
Scala docs though) */
--- End diff --

"a FILO queue" is an unconventional saying. I like it though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/14359#discussion_r72179527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
 }
   }
 
+  /**
+   * Class for queueing nodes to split on each iteration.
+   * This is a FILO queue.
+   */
+  private[impl] class NodeQueue {
+private var q: mutable.MutableList[(Int, LearningNode)] =
+  mutable.MutableList.empty[(Int, LearningNode)]
+
+/** Put (treeIndex, node) into queue. Constant time. */
+def put(treeIndex: Int, node: LearningNode): Unit = {
+  q += ((treeIndex, node))
+}
+
+/** Remove and return last inserted element.  Linear time (unclear in 
Scala docs though) */
--- End diff --

Note: This is not ideal, but it should be insignificant compared to the 
cost of communicating the trees.  I am not aware of an existing solution for a 
stack with constant-time push/pop in Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

2016-07-25 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/14359

[SPARK-16719][ML] Random Forests should communicate fewer trees on each 
iteration

## What changes were proposed in this pull request?

RandomForest currently sends the entire forest to each worker on each 
iteration. This is because (a) the node queue is FIFO and (b) the closure 
references the entire array of trees (topNodes). (a) causes RFs to handle 
splits in many trees, especially early on in learning. (b) sends all trees 
explicitly.

This PR:
(a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or 
a few trees before focusing on others.
(b) Change topNodes to pass only the trees required on that iteration.

## How was this patch tested?

Unit tests:
* Existing tests for correctness of tree learning
* New test: NodeQueue should be FILO
* Manually modifying code and running tests to verify that a small number 
of trees are communicated on each iteration
  * This last item is hard to test via unit tests given the current APIs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark rfs-fewer-trees

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14359.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14359


commit c76ec826cac1f2c43d10bf18ca5a914a572fa7b7
Author: Joseph K. Bradley 
Date:   2016-04-21T18:01:28Z

changed RandomForest node queue to be FILO and changed topNodes to pass 
only trees in each group

commit 4547a44796420ea8504e9840789893b279eae0e5
Author: Joseph K. Bradley 
Date:   2016-04-27T03:42:16Z

Added test for FILO nodeQueue

commit 6fcfb4b0e158ba86371ad4d0728490f3a8e7caeb
Author: Joseph K. Bradley 
Date:   2016-07-26T02:21:14Z

minor cleanups




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org