[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15018


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95914526
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -312,90 +313,120 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   }
 
   /**
-   * Performs a pool adjacent violators algorithm (PAV).
-   * Uses approach with single processing of data where violators
-   * in previously processed data created by pooling are fixed immediately.
-   * Uses optimization of discovering monotonicity violating sequences 
(blocks).
+   * Performs a pool adjacent violators algorithm (PAV). Implements the 
algorithm originally
+   * described in [1], using the formulation from [2, 3]. Uses an array to 
keep track of start
+   * and end indices of blocks.
*
-   * @param input Input data of tuples (label, feature, weight).
+   * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order 
simplexes." Applied
+   * mathematics and Optimization 12.1 (1984): 247-270.
+   *
+   * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set 
algorithms for isotonic
+   * regression; a unifying framework." Mathematical Programming 47.1-3 
(1990): 425-439.
+   *
+   * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. 
"Minimizing separable convex
+   * functions subject to simple chain constraints." SIAM Journal on 
Optimization 10.3 (2000):
+   * 658-672.
+   *
+   * @param input Input data of tuples (label, feature, weight). Weights 
must
+  be non-negative.
* @return Result tuples (label, feature, weight) where labels were 
updated
* to form a monotone sequence as per isotonic regression 
definition.
*/
   private def poolAdjacentViolators(
   input: Array[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
 
-if (input.isEmpty) {
-  return Array.empty
+val cleanInput = input.flatMap{ case (y, x, weight) =>
+  require(weight >= 0.0, "weights must be non-negative")
+  if (weight == 0.0) {
+logWarning(s"Dropping zero-weight point ($y, $x, $weight)")
+Array[(Double, Double, Double)]()
+  } else {
+Array((y, x, weight))
+  }
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
+if (cleanInput.isEmpty) {
+  return Array.empty
+}
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// cleanInput(i) to cleanInput(j) (inclusive), then blockBounds(i) = j 
and blockBounds(j) = i
+// Initially, each data point is its own block.
+val blockBounds = Array.range(0, cleanInput.length)
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = cleanInput.map { case (y, _, 
weight) =>
+  (weight, weight * y)
 }
 
-var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
+// a few convenience functions to make the code more readable
 
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+// blockStart and blockEnd have identical implementations. We create 
two different
+// functions to make the code more expressive
+def blockEnd(start: Int): Int = blockBounds(start)
+def blockStart(end: Int): Int = blockBounds(end)
 
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
-  pool(input, i, j)
-  i = i - 1
-}
+// the next block starts at the index after the end of this block
+def nextBlock(start: Int): Int = blockEnd(start) + 1
 
-i = j
-  }
+// the 

[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95914244
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -312,90 +313,120 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   }
 
   /**
-   * Performs a pool adjacent violators algorithm (PAV).
-   * Uses approach with single processing of data where violators
-   * in previously processed data created by pooling are fixed immediately.
-   * Uses optimization of discovering monotonicity violating sequences 
(blocks).
+   * Performs a pool adjacent violators algorithm (PAV). Implements the 
algorithm originally
+   * described in [1], using the formulation from [2, 3]. Uses an array to 
keep track of start
+   * and end indices of blocks.
*
-   * @param input Input data of tuples (label, feature, weight).
+   * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order 
simplexes." Applied
+   * mathematics and Optimization 12.1 (1984): 247-270.
+   *
+   * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set 
algorithms for isotonic
+   * regression; a unifying framework." Mathematical Programming 47.1-3 
(1990): 425-439.
+   *
+   * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. 
"Minimizing separable convex
+   * functions subject to simple chain constraints." SIAM Journal on 
Optimization 10.3 (2000):
+   * 658-672.
+   *
+   * @param input Input data of tuples (label, feature, weight). Weights 
must
+  be non-negative.
* @return Result tuples (label, feature, weight) where labels were 
updated
* to form a monotone sequence as per isotonic regression 
definition.
*/
   private def poolAdjacentViolators(
   input: Array[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
 
-if (input.isEmpty) {
-  return Array.empty
+val cleanInput = input.flatMap{ case (y, x, weight) =>
--- End diff --

filter would be simpler than flatMap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95914074
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -312,90 +313,120 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   }
 
   /**
-   * Performs a pool adjacent violators algorithm (PAV).
-   * Uses approach with single processing of data where violators
-   * in previously processed data created by pooling are fixed immediately.
-   * Uses optimization of discovering monotonicity violating sequences 
(blocks).
+   * Performs a pool adjacent violators algorithm (PAV). Implements the 
algorithm originally
+   * described in [1], using the formulation from [2, 3]. Uses an array to 
keep track of start
+   * and end indices of blocks.
*
-   * @param input Input data of tuples (label, feature, weight).
+   * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order 
simplexes." Applied
+   * mathematics and Optimization 12.1 (1984): 247-270.
+   *
+   * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set 
algorithms for isotonic
+   * regression; a unifying framework." Mathematical Programming 47.1-3 
(1990): 425-439.
+   *
+   * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. 
"Minimizing separable convex
+   * functions subject to simple chain constraints." SIAM Journal on 
Optimization 10.3 (2000):
+   * 658-672.
+   *
+   * @param input Input data of tuples (label, feature, weight). Weights 
must
+  be non-negative.
* @return Result tuples (label, feature, weight) where labels were 
updated
* to form a monotone sequence as per isotonic regression 
definition.
*/
   private def poolAdjacentViolators(
   input: Array[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
 
-if (input.isEmpty) {
-  return Array.empty
+val cleanInput = input.flatMap{ case (y, x, weight) =>
+  require(weight >= 0.0, "weights must be non-negative")
+  if (weight == 0.0) {
+logWarning(s"Dropping zero-weight point ($y, $x, $weight)")
--- End diff --

I disagree about logging for zero-weight points.  If an instance has zero 
weight, then it should be ignored, and that's not a problem.

If we really want to log this, then let's do it once per dataset, not once 
per row, which could make logs explode.  I'd also say we should lower the 
logging level to logInfo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95914179
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -312,90 +313,120 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   }
 
   /**
-   * Performs a pool adjacent violators algorithm (PAV).
-   * Uses approach with single processing of data where violators
-   * in previously processed data created by pooling are fixed immediately.
-   * Uses optimization of discovering monotonicity violating sequences 
(blocks).
+   * Performs a pool adjacent violators algorithm (PAV). Implements the 
algorithm originally
+   * described in [1], using the formulation from [2, 3]. Uses an array to 
keep track of start
+   * and end indices of blocks.
*
-   * @param input Input data of tuples (label, feature, weight).
+   * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order 
simplexes." Applied
+   * mathematics and Optimization 12.1 (1984): 247-270.
+   *
+   * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set 
algorithms for isotonic
+   * regression; a unifying framework." Mathematical Programming 47.1-3 
(1990): 425-439.
+   *
+   * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. 
"Minimizing separable convex
+   * functions subject to simple chain constraints." SIAM Journal on 
Optimization 10.3 (2000):
+   * 658-672.
+   *
+   * @param input Input data of tuples (label, feature, weight). Weights 
must
+  be non-negative.
* @return Result tuples (label, feature, weight) where labels were 
updated
* to form a monotone sequence as per isotonic regression 
definition.
*/
   private def poolAdjacentViolators(
   input: Array[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
 
-if (input.isEmpty) {
-  return Array.empty
+val cleanInput = input.flatMap{ case (y, x, weight) =>
+  require(weight >= 0.0, "weights must be non-negative")
--- End diff --

State invalid y,x,weight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-12 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95775162
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -312,10 +313,19 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   }
 
   /**
-   * Performs a pool adjacent violators algorithm (PAV).
-   * Uses approach with single processing of data where violators
-   * in previously processed data created by pooling are fixed immediately.
-   * Uses optimization of discovering monotonicity violating sequences 
(blocks).
+   * Performs a pool adjacent violators algorithm (PAV). Implements the 
algorithm originally
+   * described in [1], using the formulation from [2, 3]. Uses an array to 
keep track of start
+   * and end indices of blocks.
+   *
+   * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order 
simplexes." Applied
+   * mathematics and Optimization 12.1 (1984): 247-270.
+   *
+   * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set 
algorithms for isotonic
+   * regression; a unifying framework." Mathematical Programming 47.1-3 
(1990): 425-439.
+   *
+   * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. 
"Minimizing separable convex
+   * functions subject to simple chain constraints." SIAM Journal on 
Optimization 10.3 (2000):
+   * 658-672.
*
* @param input Input data of tuples (label, feature, weight).
--- End diff --

Can you update the param doc to indicate that now negative weights are 
prohibited? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95094551
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,80 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// input(i) to input(j) (inclusive), then blockBounds(i) = j and 
blockBounds(j) = i
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = input.map { case (y, _, weight) 
=>
+  require(weight != 0.0)
--- End diff --

Please add an informative error message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r95094550
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,80 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// input(i) to input(j) (inclusive), then blockBounds(i) = j and 
blockBounds(j) = i
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = input.map { case (y, _, weight) 
=>
+  require(weight != 0.0)
+  (weight, weight * y)
 }
 
-var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
+// a few convenience functions to make the code more readable
+
+// blockStart and blockEnd have identical implementations. We create 
two different
+// functions to make the code more expressive
+def blockEnd(start: Int): Int = blockBounds(start)
+def blockStart(end: Int): Int = blockBounds(end)
+
+// the next block starts at the index after the end of this block
+def nextBlock(start: Int): Int = blockEnd(start) + 1
+
+// the previous block ends at the index before the start of this block
+// we then use blockStart to find the start
+def prevBlock(start: Int): Int = blockStart(start - 1)
+
+// Merge two adjacent blocks, updating blockBounds and weights to 
reflect the merge
+// Return the start index of the merged block
+def merge(block1: Int, block2: Int): Int = {
+  assert(blockEnd(block1) + 1 == block2, "attempting to merge 
non-consecutive blocks")
--- End diff --

Please make the error message more informative:
* Make it clear that this indicates an internal bug within 
IsotonicRegression
* State the invalid values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-04 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r94641912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,80 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// input(i) to input(j) (inclusive), then blockBounds(i) = j and 
blockBounds(j) = i
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = input.map { case (y, _, weight) 
=>
+  require(weight != 0.0)
--- End diff --

It's a weird thing to do, but I don't think there's any reason we can't 
support it. The problem still has a unique solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r94640811
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,80 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// input(i) to input(j) (inclusive), then blockBounds(i) = j and 
blockBounds(j) = i
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = input.map { case (y, _, weight) 
=>
+  require(weight != 0.0)
--- End diff --

Are negative weights OK? (you don't need a type on `weight`, but hardly 
matters)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r94571308
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,81 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// input(i) to input(j) (inclusive), then blockBounds(i) = j and 
blockBounds(j) = i
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = input.map {
+  case (_, _, weight) if weight == 0d =>
--- End diff --

I'd always write 0.0 instead of 0d for clarity. I also think we want an 
`IllegalArgumentException`, so maybe:

```
val weights = input.map { case (y, _, weight) =>
  require(weight > 0.0)
  (weight, weight * y)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-20 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r9328
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,69 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+// index of the end of the block and blockBounds(end) gives the index 
of the start of the
+// block. Entries that are not the start or end of the block are 
meaningless. The idea is that
--- End diff --

It relies on knowing ahead of time wether `x` is the start or end 
index1. If it's a start index, `blockBounds(x)` gives the ending 
index of that block. If it's an end index, `blockBounds(x)` gives the starting 
index of the block. So yes, the implementations of `blockStart` and `blockEnd` 
are identical. I just have two different functions because it makes the code 
more readable.

Maybe the comment from scikit-learn (where I borrowed this idea from) 
explains it better? (their `target` = my `blockBounds`)

> target describes a list of blocks.  At any time, if [i..j] (inclusive) is
> an active block, then blockBounds[i] := j and target[j] := i.

The trick is just in maintaining the array so that the above property is 
always true. At initialization, it's trivially true because all blocks have 
only one element, and `blockBounds(x)` = `x`. After initialization, 
`blockBounds` is only modified by the merge function, which is set up to modify 
`blockBounds` so that this property is preserved, then return the starting 
index of the newly-merged block.

This is admittedly a bit tricky, but it's a lot faster than the 
implementation I did where created a doubly-linked list of `Block` case 
classes. I'm open to suggestions on how to explain it better.

1 You could actually figure out whether you have a start or an 
end index by comparing `blockBounds(x)` to `x`. The lesser value will be the 
start index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r93229282
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,69 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+// index of the end of the block and blockBounds(end) gives the index 
of the start of the
+// block. Entries that are not the start or end of the block are 
meaningless. The idea is that
--- End diff --

I'm still not sure about this comment -- how can `blockBounds(x)` be both 
the start and end of a block? the implementation below is identical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92920572
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,68 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
-
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
-
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+/*
+Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+index of the end of the block and blockBounds(end) gives the index of 
the start of the
+block. Entries that are not the start or end of the block are 
meaningless.
+*/
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
+
+/*
+Keep track of the sum of weights and sum of weight * y for each block. 
weights(start)
+gives the values for the block. Entries that are not at the start of a 
block
+are meaningless.
+*/
+val weights: Array[(Double, Double)] = input.map(x => (x._3, x._3 * 
x._1)) // (weight, weight * y)
--- End diff --

Up to you; might be clearer as `input.map { case (y, _, weight) => (weight, 
weight * y) }`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92920619
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,68 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
-
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
-
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+/*
+Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+index of the end of the block and blockBounds(end) gives the index of 
the start of the
--- End diff --

blockBounds(start) is the _end_ of a block? this is reflected in your 
helper functions below but it seems reversed. Can you double check and 
elaborate a bit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92920879
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,68 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
-
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
-
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+/*
+Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+index of the end of the block and blockBounds(end) gives the index of 
the start of the
+block. Entries that are not the start or end of the block are 
meaningless.
+*/
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
+
+/*
+Keep track of the sum of weights and sum of weight * y for each block. 
weights(start)
+gives the values for the block. Entries that are not at the start of a 
block
+are meaningless.
+*/
+val weights: Array[(Double, Double)] = input.map(x => (x._3, x._3 * 
x._1)) // (weight, weight * y)
+
+// a few convenience functions to make the code more readable
+def blockEnd(start: Int) = blockBounds(start)
+def blockStart(end: Int) = blockBounds(end)
+def nextBlock(start: Int) = blockEnd(start) + 1
+def prevBlock(start: Int) = blockStart(start - 1)
+def merge(block1: Int, block2: Int): Int = {
+  assert(blockEnd(block1) + 1 == block2, "attempting to merge 
non-consecutive blocks")
+  blockBounds(block1) = blockEnd(block2)
+  blockBounds(blockEnd(block2)) = block1
+  val w1 = weights(block1)
+  val w2 = weights(block2)
+  weights(block1) = (w1._1 + w2._1, w1._2 + w2._2)
+  block1
 }
+def average(start: Int) = weights(start)._2 / weights(start)._1
 
+/*
+Implement Algorithm PAV from [3].
+Merge on >= instead of > because it elimnate adjacent blocks with the 
same average, and we want
+to compress our output as much as possible. Both give correct results.
+*/
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
-
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
-  pool(input, i, j)
-  i = i - 1
+while (nextBlock(i) < input.length) {
+  if (average(i) >= average(nextBlock(i))) {
+merge(i, nextBlock(i))
+while((i > 0) && (average(prevBlock(i)) >= average(i))) {
--- End diff --

Super nit: space before `while` here and below. You can probably nix the 
extra parens around each of the two terms, but no big deal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-14 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92432636
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
--- End diff --

I hadn't actually seen that paper. I just worked from the scikit-learn 
implementation (which, incidentally, has changed since I submitted this PR).

I believe that either way will get you correct results. Doing `>=` should 
be faster in some cases, because it's greedier about pooling, which should lead 
to fewer iterations of the outer loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-14 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92428442
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
+  j = j + 1
+}
 
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
+// Pool the violating sequence with the data point preceding it
+if (input(i)._1 != input(j)._1) {
--- End diff --

Correct. Any violations that are introduced will be fixed in the next 
iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92331409
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
--- End diff --

I think the original one i.e., `input(j)._1 > input(j + 1)._1` is correct. 
Here it is going to select out-of-order blocks.

Quoted from the paper:
> We refer to two blocks [p, q] and [q + 1, r] as consecutive. We refer to 
two consecutive blocks [p, q] and [q +1, r] as in-order if  theta_pq <= 
theta_q+1, r and out-of-order otherwise.

LEMMA 1 is pointing the how a merged block is also a single-valued block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92322282
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
+  j = j + 1
+}
 
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
+// Pool the violating sequence with the data point preceding it
--- End diff --

Is this comment correct? Looks like you pool the violating sequence [i, j] 
only. The preceding data points are checked for pooling in next outer loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92298473
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
+  j = j + 1
+}
 
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
+// Pool the violating sequence with the data point preceding it
+if (input(i)._1 != input(j)._1) {
--- End diff --

I believe the logic is:
* If equality holds here, then either (a) i = j or (b) the block [i,j] has 
equal elements.
* A violation could be introduced but will be caught on the next iteration 
of the outer loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92298204
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
--- End diff --

That should be correct.  Lemma 1 from "A Fast Scaling Algorithm for 
Minimizing Separable Convex Functions Subject to Chain Constraints"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92232757
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
+  j = j + 1
+}
 
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
+// Pool the violating sequence with the data point preceding it
+if (input(i)._1 != input(j)._1) {
--- End diff --

This condition is "if there was any decrease, not just == from i to j" 
right? and `pool` makes everything in [i,j] non-violating.

Comments from the previous code note that this could introduce a violation 
-- is this still possible? it wasn't obvious why it can't happen now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92230452
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
+  j = j + 1
--- End diff --

Nit: `j += 1` and likewise elsewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92231712
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
--- End diff --

The condition changed to >= here on purpose right? you're now looking for a 
region that's non-increasing, not just strictly decreasing. Just checking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-09-08 Thread neggert
GitHub user neggert opened a pull request:

https://github.com/apache/spark/pull/15018

[SPARK-17455][MLlib] Improve PAVA implementation in IsotonicRegression

## What changes were proposed in this pull request?

New implementation of the Pool Adjacent Violators Algorithm (PAVA) in 
mllib.IsotonicRegression. The previous implementation could have factorial 
complexity in the worst case. This implementation, which closely follows those 
in scikit-learn and the R `iso` package, runs in quadratic time in the worst 
case.

## How was this patch tested?

Existing unit tests in both `mllib` and `ml` passed before and after this 
patch. Scaling properties were tested by running the `poolAdjacentViolators` 
method in 
[scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template)
 with the input generated by

```{scala}
val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 
1.5 else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
x), w) => (y, x, w)}
```

Before this patch:

| Input Length | Time (us) |
| :| -:|
|100   |   1.35|
|200   |   3.14|
|400   | 116.10|
|800   | 2134225.90|

After this patch:

| Input Length | Time (us) |
| :| -:|
|100   |   1.25|
|200   |   2.53|
|400   |   5.86|
|800   |  10.55|

Benchmarking was also performed with randomly-generated y values, with 
similar results.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/neggert/spark SPARK-17455-isoreg-algo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15018


commit 5608ae1f4c9089dd3a299eef14af59bed6c96192
Author: z001qdp 
Date:   2016-09-08T21:43:24Z

New PAVA implementation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org