from:"mengxr"

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

2014-09-12 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2341#discussion_r17465271
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -87,17 +87,11 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
 val maxDepth = strategy.maxDepth
 require(maxDepth = 30,
   sDecisionTree currently only supports maxDepth = 30, but was given 
maxDepth = $maxDepth.)
-// Number of nodes to allocate: max number of nodes possible given the 
depth of the tree, plus 1
-val maxNumNodesPlus1 = Node.startIndexInLevel(maxDepth + 1)
-// Initialize an array to hold parent impurity calculations for each 
node.
-val parentImpurities = new Array[Double](maxNumNodesPlus1)
-// dummy value for top node (updated during first split calculation)
-val nodes = new Array[Node](maxNumNodesPlus1)
 
 // Calculate level for single group construction
 
 // Max memory usage for aggregates
-val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024
+val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L
--- End diff --

It is also useful to set an upper bound here (e.g., 1GB) to avoid memory/GC 
problems on the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

2014-09-12 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2341#discussion_r17466105
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -120,81 +114,35 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
  * beforehand and is not used in later levels.
  */
 
+var topNode: Node = null // set on first iteration
 var level = 0
 var break = false
 while (level = maxDepth  !break) {
-
   logDebug(#)
   logDebug(level =  + level)
   logDebug(#)
 
   // Find best split for all nodes at a level.
   timer.start(findBestSplits)
-  val splitsStatsForLevel: Array[(Split, InformationGainStats, 
Predict)] =
-DecisionTree.findBestSplits(treeInput, parentImpurities,
-  metadata, level, nodes, splits, bins, maxLevelForSingleGroup, 
timer)
+  val (tmpTopNode: Node, doneTraining: Boolean) = 
DecisionTree.findBestSplits(treeInput,
+metadata, level, topNode, splits, bins, maxLevelForSingleGroup, 
timer)
   timer.stop(findBestSplits)
 
-  val levelNodeIndexOffset = Node.startIndexInLevel(level)
-  for ((nodeSplitStats, index) - 
splitsStatsForLevel.view.zipWithIndex) {
-val nodeIndex = levelNodeIndexOffset + index
-
-// Extract info for this node (index) at the current level.
-timer.start(extractNodeInfo)
-val split = nodeSplitStats._1
-val stats = nodeSplitStats._2
-val predict = nodeSplitStats._3.predict
-val isLeaf = (stats.gain = 0) || (level == strategy.maxDepth)
-val node = new Node(nodeIndex, predict, isLeaf, Some(split), None, 
None, Some(stats))
-logDebug(Node =  + node)
-nodes(nodeIndex) = node
-timer.stop(extractNodeInfo)
-
-if (level != 0) {
-  // Set parent.
-  val parentNodeIndex = Node.parentIndex(nodeIndex)
-  if (Node.isLeftChild(nodeIndex)) {
-nodes(parentNodeIndex).leftNode = Some(nodes(nodeIndex))
-  } else {
-nodes(parentNodeIndex).rightNode = Some(nodes(nodeIndex))
-  }
-}
-// Extract info for nodes at the next lower level.
-timer.start(extractInfoForLowerLevels)
-if (level  maxDepth) {
-  val leftChildIndex = Node.leftChildIndex(nodeIndex)
-  val leftImpurity = stats.leftImpurity
-  logDebug(leftChildIndex =  + leftChildIndex + , impurity =  
+ leftImpurity)
-  parentImpurities(leftChildIndex) = leftImpurity
-
-  val rightChildIndex = Node.rightChildIndex(nodeIndex)
-  val rightImpurity = stats.rightImpurity
-  logDebug(rightChildIndex =  + rightChildIndex + , impurity = 
 + rightImpurity)
-  parentImpurities(rightChildIndex) = rightImpurity
-}
-timer.stop(extractInfoForLowerLevels)
-logDebug(final best split =  + split)
+  if (level == 0) {
+topNode = tmpTopNode
   }
-  require(Node.maxNodesInLevel(level) == splitsStatsForLevel.length)
-  // Check whether all the nodes at the current level at leaves.
-  val allLeaf = splitsStatsForLevel.forall(_._2.gain = 0)
-  logDebug(all leaf =  + allLeaf)
-  if (allLeaf) {
-break = true // no more tree construction
-  } else {
-level += 1
+  if (doneTraining) {
+break = true
--- End diff --

Shall we remove `break` and only use `doneTraining`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

2014-09-12 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2341#discussion_r17466108
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -435,18 +385,18 @@ object DecisionTree extends Serializable with Logging 
{
   // numGroups is equal to 2 at level 11 and 4 at level 12, 
respectively.
   val numGroups = 1  level - maxLevelForSingleGroup
   logDebug(numGroups =  + numGroups)
-  var bestSplits = new Array[(Split, InformationGainStats, Predict)](0)
   // Iterate over each group of nodes at a level.
   var groupIndex = 0
+  var doneTraining = true
   while (groupIndex  numGroups) {
-val bestSplitsForGroup = findBestSplitsPerGroup(input, 
parentImpurities, metadata, level,
-  nodes, splits, bins, timer, numGroups, groupIndex)
-bestSplits = Array.concat(bestSplits, bestSplitsForGroup)
+val (tmpRoot, doneTrainingGroup) = findBestSplitsPerGroup(input, 
metadata, level,
--- End diff --

`tmpRoot` - `_` (since it is not used after)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55374976
  
@davies It is hard to tell whether we already have fast access to the input 
RDD. Force caching may cause problems, e.g.,

1. kicking out some cached RDDs,
2. using too much memory if the input data is large but it could be 
generated from a small RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2356#issuecomment-55375085
  
@davies Could you take a look at this PR and see whether there is an easier 
way for SerDe? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2341#issuecomment-55375636
  
LGTM except minor inline comments. I'm merging this in and could you make 
the changes with your next update? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3396][MLLIB] Use SquaredL2Updater in Lo...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2231#issuecomment-55376180
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]Collapsed Gibbs sampli...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1983#issuecomment-55381531
  
@witgo @allwefantasy 

English | èªå¨ç¿»è¯çä¸æ
|
Let's try to keep the comments in English as much as possible. | 
è®©æä»¬å°½éä¿ææè§çè±æå°½å¯è½ã
If you feel hard or impossible to express something in English, maybe we 
can try putting Chinese and auto-translated English side-by-side. | å¦æä½ 
è§å¾å¾é¾ææ 
æ³è¡¨è¾¾çä¸è¥¿ç¨è±æï¼ä¹è®¸æä»¬å¯ä»¥å°è¯æä¸å½åèªå¨ç¿»è¯è±æå¹¶æä¾§ã
This is definitely better than writing nothing and also help non-Chinese 
developers understand what is going on at a high level. |  
è¿ç»å¯¹æ¯æ¯åä»ä¹å¥½ï¼ä¹æå©äºéä¸å½å¼ååæç½ä»ä¹æ¯å¨ä¸ä¸ªè¾é«çæ°´å¹³æä¹åäºã
I'm trying to demo this by putting Google-translated Chinese on the right 
side. | æè¯å¾éè¿å°è°·æç¿»è¯çä¸å½å³ä¾§æ¥æ¼ç¤ºè¿ä¸ç¹ã
It is definitely not high-quality but better than nothing. | 
è¿ç»å¯¹ä¸æ¯é«åè´¨çï¼ä½æ»æ¯æ²¡æå¥½ã


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]Collapsed Gibbs sampli...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1983#issuecomment-55381621
  
@witgo @allwefantasy We had an offline discussion about LDA's 
implementation. Please check the JIRA page for the notes.

--

æä»¬æå¤§çº¦LDAçå®ç°è±æºè®¨è®ºãè¯·æ£æ¥JIRAé¡µçæ³¨éã


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-55383567
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3396][MLLIB] Use SquaredL2Updater in Lo...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2231#issuecomment-55428631
  
@BigCrunsh Just saw that the target is `branch-1.0`. Could you change the 
target to `master`? Usually we first apply the patch to master and then 
backport it to old branches.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [mllib] DecisionTree: Add minInstancesPerNode,...

2014-09-12 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2349#issuecomment-55429412
  
@jkbradley This contains API changes to python. Could you create a JIRA for 
it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574390
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -17,16 +17,18 @@
 
 package org.apache.spark.mllib.api.python
 
-import java.nio.{ByteBuffer, ByteOrder}
+import java.io.OutputStream
 
 import scala.collection.JavaConverters._
 
+import net.razorvine.pickle.{Pickler, Unpickler, IObjectConstructor, 
IObjectPickler, PickleException, Opcodes}
--- End diff --

use `_`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574399
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
--- End diff --

Does it increase the storage cost by a lot for small objects?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574385
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -778,8 +778,8 @@ private[spark] object PythonRDD extends Logging {
   def javaToPython(jRDD: JavaRDD[Any]): JavaRDD[Array[Byte]] = {
 jRDD.rdd.mapPartitions { iter =
   val pickle = new Pickler
-  iter.map { row =
-pickle.dumps(row)
+  iter.grouped(1024).map { rows =
--- End diff --

Shall we divide groups based on the serialized size?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574396
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
--- End diff --

use camelCase for method names


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574404
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
+out.write(Opcodes.MARK)
+objects.foreach(pickler.save(_))
+out.write(Opcodes.TUPLE)
+out.write(Opcodes.REDUCE)
+  }
 
-  def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance)
+  private[python] class DenseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val vector: DenseVector = obj.asInstanceOf[DenseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, 
vector.toArray)
--- End diff --

ditto: what is the cost of using class names?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574578
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -60,18 +60,18 @@ class PythonMLLibAPI extends Serializable {
   def loadLabeledPoints(
   jsc: JavaSparkContext,
   path: String,
-  minPartitions: Int): JavaRDD[Array[Byte]] =
-MLUtils.loadLabeledPoints(jsc.sc, path, 
minPartitions).map(SerDe.serializeLabeledPoint)
+  minPartitions: Int): JavaRDD[LabeledPoint] =
+MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions)
 
   private def trainRegressionModel(
   trainFunc: (RDD[LabeledPoint], Vector) = GeneralizedLinearModel,
-  dataBytesJRDD: JavaRDD[Array[Byte]],
+  dataJRDD: JavaRDD[Any],
   initialWeightsBA: Array[Byte]): 
java.util.LinkedList[java.lang.Object] = {
-val data = dataBytesJRDD.rdd.map(SerDe.deserializeLabeledPoint)
-val initialWeights = SerDe.deserializeDoubleVector(initialWeightsBA)
+val data = dataJRDD.rdd.map(_.asInstanceOf[LabeledPoint])
--- End diff --

maybe we can try `dataJRDD.rdd.asInstanceOf[RDD[LabeledPoint]]` instead of 
`map`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574784
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
+out.write(Opcodes.MARK)
+objects.foreach(pickler.save(_))
+out.write(Opcodes.TUPLE)
+out.write(Opcodes.REDUCE)
+  }
 
-  def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance)
+  private[python] class DenseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val vector: DenseVector = obj.asInstanceOf[DenseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, 
vector.toArray)
+}
+  }
 
-  def count: Long = summary.count
+  private[python] class DenseVectorConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 1)
+  new DenseVector(args(0).asInstanceOf[Array[Double]])
+}
+  }
+
+  private[python] class DenseMatrixPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val m: DenseMatrix = obj.asInstanceOf[DenseMatrix]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix,
+m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], 
m.values)
+}
+  }
 
-  def numNonzeros: Array[Byte] = 
SerDe.serializeDoubleVector(summary.numNonzeros)
+  private[python] class DenseMatrixConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-  def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max)
+  private[python] class SparseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val v: SparseVector = obj.asInstanceOf[SparseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector,
+v.size.asInstanceOf[Object], v.indices, v.values)
+}
+  }
 
-  def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min)
-}
+  private[python] class SparseVectorConstructor extends IObjectConstructor 
{
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new SparseVector(args(0).asInstanceOf[Int], 
args(1).asInstanceOf[Array[Int]],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-/**
- * SerDe utility functions for PythonMLLibAPI.
- */
-private[spark] object SerDe extends Serializable {
-  private val DENSE_VECTOR_MAGIC: Byte = 1
-  private val SPARSE_VECTOR_MAGIC: Byte = 2
-  private val DENSE_MATRIX_MAGIC: Byte = 3
-  private val LABELED_POINT_MAGIC: Byte = 4
-
-  private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: 
Int = 0): Vector = {
-require(bytes.length - offset = 5, Byte array too short)
-val magic = bytes(offset)
-if (magic == DENSE_VECTOR_MAGIC) {
-  deserializeDenseVector(bytes, offset)
-} else if (magic == SPARSE_VECTOR_MAGIC) {
-  deserializeSparseVector(bytes, offset)
-} else {
-  throw new IllegalArgumentException(Magic  + magic +  is wrong.)
+  private[python] class LabeledPointPickler extends IObjectPickler {
+def pickle

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574827
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
+out.write(Opcodes.MARK)
+objects.foreach(pickler.save(_))
+out.write(Opcodes.TUPLE)
+out.write(Opcodes.REDUCE)
+  }
 
-  def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance)
+  private[python] class DenseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val vector: DenseVector = obj.asInstanceOf[DenseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, 
vector.toArray)
+}
+  }
 
-  def count: Long = summary.count
+  private[python] class DenseVectorConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 1)
+  new DenseVector(args(0).asInstanceOf[Array[Double]])
+}
+  }
+
+  private[python] class DenseMatrixPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val m: DenseMatrix = obj.asInstanceOf[DenseMatrix]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix,
+m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], 
m.values)
+}
+  }
 
-  def numNonzeros: Array[Byte] = 
SerDe.serializeDoubleVector(summary.numNonzeros)
+  private[python] class DenseMatrixConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-  def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max)
+  private[python] class SparseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val v: SparseVector = obj.asInstanceOf[SparseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector,
+v.size.asInstanceOf[Object], v.indices, v.values)
+}
+  }
 
-  def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min)
-}
+  private[python] class SparseVectorConstructor extends IObjectConstructor 
{
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new SparseVector(args(0).asInstanceOf[Int], 
args(1).asInstanceOf[Array[Int]],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-/**
- * SerDe utility functions for PythonMLLibAPI.
- */
-private[spark] object SerDe extends Serializable {
-  private val DENSE_VECTOR_MAGIC: Byte = 1
-  private val SPARSE_VECTOR_MAGIC: Byte = 2
-  private val DENSE_MATRIX_MAGIC: Byte = 3
-  private val LABELED_POINT_MAGIC: Byte = 4
-
-  private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: 
Int = 0): Vector = {
-require(bytes.length - offset = 5, Byte array too short)
-val magic = bytes(offset)
-if (magic == DENSE_VECTOR_MAGIC) {
-  deserializeDenseVector(bytes, offset)
-} else if (magic == SPARSE_VECTOR_MAGIC) {
-  deserializeSparseVector(bytes, offset)
-} else {
-  throw new IllegalArgumentException(Magic  + magic +  is wrong.)
+  private[python] class LabeledPointPickler extends IObjectPickler {
+def pickle

[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55680211
  
@staple How many iterations did you run? Did you generate data or load from 
disk/hdfs? Did you cache the Python RDD? When the dataset is not fully cached, 
I still expect similar performance. But your result shows a big gap. Maybe it 
is rotating cached blocks.

 10m records:
 master: 2130.3178509871
 w/ patch: 3232.856136322


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3396][MLLIB] Use SquaredL2Updater in Lo...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2398#issuecomment-55680373
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] Update SVD documentation in IndexedRow...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2389#issuecomment-55680470
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1778#issuecomment-55680636
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3516] [mllib] DecisionTree: Add minInst...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2349#issuecomment-55680612
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-2329 Add multi-label evaluation ...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1270#discussion_r17578309
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.scala 
---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext._
+
+/**
+ * Evaluator for multilabel classification.
+ * @param predictionAndLabels an RDD of (predictions, labels) pairs, both 
are non-null sets.
+ */
+class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], 
Set[Double])]) {
--- End diff --

@avulanov Let's think what is more natural for the input data to a 
multi-label classifier and the output from the model it produces. They should 
match the input type here, so we can chain them easily. If we use either Java 
or Scala `Set`, we are going to have compatibility issues on the other. Also, 
set stores small objects, which increase GC pressure. There are the reasons I 
recommend using `Array[Double]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579647
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -18,6 +18,7 @@
 package org.apache.spark.mllib.linalg.distributed
 
 import java.util.Arrays
+import scala.collection.mutable.ListBuffer
--- End diff --

separate scala imports from java ones


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579667
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -390,6 +393,113 @@ class RowMatrix(
 new RowMatrix(AB, nRows, B.numCols)
   }
 
+  /**
+   * Compute all cosine similarities between columns of this matrix using 
the brute-force
+   * approach of computing normalized dot products.
+   *
+   * @return An n x n sparse upper-triangular matrix of cosine 
similarities between
+   * columns of this matrix.
+   */
+  def columnSimilarities(): CoordinateMatrix = {
+similarColumns(0.0)
+  }
+
+  /**
+   * Compute all similarities between columns of this matrix using a 
sampling approach.
+   *
+   * The threshold parameter is a trade-off knob between estimate quality 
and computational cost.
+   *
+   * Setting a threshold of 0 guarantees deterministic correct results, 
but comes at exactly
+   * the same cost as the brute-force approach. Setting the threshold to 
positive values
+   * incurs strictly less computational cost than the brute-force 
approach, however the
+   * similarities computed will be estimates.
+   *
+   * The sampling guarantees relative-error correctness for those pairs of 
columns that have
+   * similarity greater than the given similarity threshold.
+   *
+   * To describe the guarantee, we set some notation:
+   * Let A be the smallest in magnitude non-zero element of this matrix.
+   * Let B be the largest  in magnitude non-zero element of this matrix.
+   * Let L be the number of non-zeros per row.
--- End diff --

Is it average or max number of nonzeros?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579671
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -390,6 +393,113 @@ class RowMatrix(
 new RowMatrix(AB, nRows, B.numCols)
   }
 
+  /**
+   * Compute all cosine similarities between columns of this matrix using 
the brute-force
+   * approach of computing normalized dot products.
+   *
+   * @return An n x n sparse upper-triangular matrix of cosine 
similarities between
+   * columns of this matrix.
+   */
+  def columnSimilarities(): CoordinateMatrix = {
+similarColumns(0.0)
+  }
+
+  /**
+   * Compute all similarities between columns of this matrix using a 
sampling approach.
+   *
+   * The threshold parameter is a trade-off knob between estimate quality 
and computational cost.
+   *
+   * Setting a threshold of 0 guarantees deterministic correct results, 
but comes at exactly
+   * the same cost as the brute-force approach. Setting the threshold to 
positive values
+   * incurs strictly less computational cost than the brute-force 
approach, however the
+   * similarities computed will be estimates.
+   *
+   * The sampling guarantees relative-error correctness for those pairs of 
columns that have
+   * similarity greater than the given similarity threshold.
+   *
+   * To describe the guarantee, we set some notation:
+   * Let A be the smallest in magnitude non-zero element of this matrix.
+   * Let B be the largest  in magnitude non-zero element of this matrix.
+   * Let L be the number of non-zeros per row.
+   *
+   * For example, for {0,1} matrices: A=B=1.
+   * Another example, for the Netflix matrix: A=1, B=5
+   *
+   * For those column pairs that are above the threshold,
+   * the computed similarity is correct to within 20% relative error with 
probability
+   * at least 1 - (0.981)^(100/B)
+   *
+   * The shuffle size is bounded by the *smaller* of the following two 
expressions:
+   *
+   * O(n log(n) L / (threshold * A))
+   * O(m L^2)
+   *
+   * The latter is the cost of the brute-force approach, so for non-zero 
thresholds,
+   * the cost is always cheaper than the brute-force approach.
+   *
+   * @param threshold Similarities above this threshold are probably 
computed correctly.
+   *  Set to 0 for deterministic guaranteed correctness.
+   * @return An n x n sparse upper-triangular matrix of cosine similarities
+   * between columns of this matrix.
+   */
+  def similarColumns(threshold: Double): CoordinateMatrix = {
+require(threshold = 0  threshold = 1, sThreshold not in [0,1]: 
$threshold)
+
+val gamma = if (math.abs(threshold)  1e-6) {
--- End diff --

remove `math.abs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579677
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -390,6 +393,113 @@ class RowMatrix(
 new RowMatrix(AB, nRows, B.numCols)
   }
 
+  /**
+   * Compute all cosine similarities between columns of this matrix using 
the brute-force
+   * approach of computing normalized dot products.
+   *
+   * @return An n x n sparse upper-triangular matrix of cosine 
similarities between
+   * columns of this matrix.
+   */
+  def columnSimilarities(): CoordinateMatrix = {
+similarColumns(0.0)
+  }
+
+  /**
+   * Compute all similarities between columns of this matrix using a 
sampling approach.
+   *
+   * The threshold parameter is a trade-off knob between estimate quality 
and computational cost.
+   *
+   * Setting a threshold of 0 guarantees deterministic correct results, 
but comes at exactly
+   * the same cost as the brute-force approach. Setting the threshold to 
positive values
+   * incurs strictly less computational cost than the brute-force 
approach, however the
+   * similarities computed will be estimates.
+   *
+   * The sampling guarantees relative-error correctness for those pairs of 
columns that have
+   * similarity greater than the given similarity threshold.
+   *
+   * To describe the guarantee, we set some notation:
+   * Let A be the smallest in magnitude non-zero element of this matrix.
+   * Let B be the largest  in magnitude non-zero element of this matrix.
+   * Let L be the number of non-zeros per row.
+   *
+   * For example, for {0,1} matrices: A=B=1.
+   * Another example, for the Netflix matrix: A=1, B=5
+   *
+   * For those column pairs that are above the threshold,
+   * the computed similarity is correct to within 20% relative error with 
probability
+   * at least 1 - (0.981)^(100/B)
+   *
+   * The shuffle size is bounded by the *smaller* of the following two 
expressions:
+   *
+   * O(n log(n) L / (threshold * A))
+   * O(m L^2)
+   *
+   * The latter is the cost of the brute-force approach, so for non-zero 
thresholds,
+   * the cost is always cheaper than the brute-force approach.
+   *
+   * @param threshold Similarities above this threshold are probably 
computed correctly.
+   *  Set to 0 for deterministic guaranteed correctness.
+   * @return An n x n sparse upper-triangular matrix of cosine similarities
+   * between columns of this matrix.
+   */
+  def similarColumns(threshold: Double): CoordinateMatrix = {
+require(threshold = 0  threshold = 1, sThreshold not in [0,1]: 
$threshold)
+
+val gamma = if (math.abs(threshold)  1e-6) {
+  Double.PositiveInfinity
+} else {
+  100 * math.log(numCols()) / threshold
+}
+
+similarColumnsDIMSUM(computeColumnSummaryStatistics().normL2.toArray, 
gamma)
+  }
+
+  /**
+   * Find all similar columns using the DIMSUM sampling algorithm, 
described in two papers
+   *
+   * http://arxiv.org/abs/1206.2082
+   * http://arxiv.org/abs/1304.1467
+   *
+   * @param colMags A vector of column magnitudes
+   * @param gamma The oversampling parameter. For provable results, set to 
100 * log(n) / s,
+   *  where s is the smallest similarity score to be estimated,
+   *  and n is the number of columns
+   * @return An n x n sparse upper-triangular matrix of cosine similarities
+   * between columns of this matrix.
+   */
+  private[mllib] def similarColumnsDIMSUM(colMags: Array[Double],
+  gamma: Double): CoordinateMatrix 
= {
+require(gamma  1.0, sOversampling should be greater than 1: $gamma)
+require(colMags.size == this.numCols(), Number of magnitudes didn't 
match column dimension)
+
+val sg = math.sqrt(gamma) // sqrt(gamma) used many times
+
+val sims = rows.flatMap { row =
+  val buf = new ListBuffer[((Int, Int), Double)]()
+  row.toBreeze.activeIterator.foreach {
+case (_, 0.0) = // Skip explicit zero elements.
+case (i, iVal) =
+  val rand = new scala.util.Random(iVal.toLong)
--- End diff --

It won't give you a pseudo random sequence. Think about the case when all 
values are the same. The seed should be set on a partition level. Could you 
update this block with `mapPartitionsWithIndex`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579676
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -390,6 +393,113 @@ class RowMatrix(
 new RowMatrix(AB, nRows, B.numCols)
   }
 
+  /**
+   * Compute all cosine similarities between columns of this matrix using 
the brute-force
+   * approach of computing normalized dot products.
+   *
+   * @return An n x n sparse upper-triangular matrix of cosine 
similarities between
+   * columns of this matrix.
+   */
+  def columnSimilarities(): CoordinateMatrix = {
+similarColumns(0.0)
+  }
+
+  /**
+   * Compute all similarities between columns of this matrix using a 
sampling approach.
+   *
+   * The threshold parameter is a trade-off knob between estimate quality 
and computational cost.
+   *
+   * Setting a threshold of 0 guarantees deterministic correct results, 
but comes at exactly
+   * the same cost as the brute-force approach. Setting the threshold to 
positive values
+   * incurs strictly less computational cost than the brute-force 
approach, however the
+   * similarities computed will be estimates.
+   *
+   * The sampling guarantees relative-error correctness for those pairs of 
columns that have
+   * similarity greater than the given similarity threshold.
+   *
+   * To describe the guarantee, we set some notation:
+   * Let A be the smallest in magnitude non-zero element of this matrix.
+   * Let B be the largest  in magnitude non-zero element of this matrix.
+   * Let L be the number of non-zeros per row.
+   *
+   * For example, for {0,1} matrices: A=B=1.
+   * Another example, for the Netflix matrix: A=1, B=5
+   *
+   * For those column pairs that are above the threshold,
+   * the computed similarity is correct to within 20% relative error with 
probability
+   * at least 1 - (0.981)^(100/B)
+   *
+   * The shuffle size is bounded by the *smaller* of the following two 
expressions:
+   *
+   * O(n log(n) L / (threshold * A))
+   * O(m L^2)
+   *
+   * The latter is the cost of the brute-force approach, so for non-zero 
thresholds,
+   * the cost is always cheaper than the brute-force approach.
+   *
+   * @param threshold Similarities above this threshold are probably 
computed correctly.
+   *  Set to 0 for deterministic guaranteed correctness.
+   * @return An n x n sparse upper-triangular matrix of cosine similarities
+   * between columns of this matrix.
+   */
+  def similarColumns(threshold: Double): CoordinateMatrix = {
+require(threshold = 0  threshold = 1, sThreshold not in [0,1]: 
$threshold)
+
+val gamma = if (math.abs(threshold)  1e-6) {
+  Double.PositiveInfinity
+} else {
+  100 * math.log(numCols()) / threshold
+}
+
+similarColumnsDIMSUM(computeColumnSummaryStatistics().normL2.toArray, 
gamma)
+  }
+
+  /**
+   * Find all similar columns using the DIMSUM sampling algorithm, 
described in two papers
+   *
+   * http://arxiv.org/abs/1206.2082
+   * http://arxiv.org/abs/1304.1467
+   *
+   * @param colMags A vector of column magnitudes
+   * @param gamma The oversampling parameter. For provable results, set to 
100 * log(n) / s,
+   *  where s is the smallest similarity score to be estimated,
+   *  and n is the number of columns
+   * @return An n x n sparse upper-triangular matrix of cosine similarities
+   * between columns of this matrix.
+   */
+  private[mllib] def similarColumnsDIMSUM(colMags: Array[Double],
+  gamma: Double): CoordinateMatrix 
= {
--- End diff --

4-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579669
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -390,6 +393,113 @@ class RowMatrix(
 new RowMatrix(AB, nRows, B.numCols)
   }
 
+  /**
+   * Compute all cosine similarities between columns of this matrix using 
the brute-force
+   * approach of computing normalized dot products.
+   *
+   * @return An n x n sparse upper-triangular matrix of cosine 
similarities between
+   * columns of this matrix.
+   */
+  def columnSimilarities(): CoordinateMatrix = {
+similarColumns(0.0)
+  }
+
+  /**
+   * Compute all similarities between columns of this matrix using a 
sampling approach.
+   *
+   * The threshold parameter is a trade-off knob between estimate quality 
and computational cost.
+   *
+   * Setting a threshold of 0 guarantees deterministic correct results, 
but comes at exactly
+   * the same cost as the brute-force approach. Setting the threshold to 
positive values
+   * incurs strictly less computational cost than the brute-force 
approach, however the
+   * similarities computed will be estimates.
+   *
+   * The sampling guarantees relative-error correctness for those pairs of 
columns that have
+   * similarity greater than the given similarity threshold.
+   *
+   * To describe the guarantee, we set some notation:
+   * Let A be the smallest in magnitude non-zero element of this matrix.
+   * Let B be the largest  in magnitude non-zero element of this matrix.
+   * Let L be the number of non-zeros per row.
+   *
+   * For example, for {0,1} matrices: A=B=1.
+   * Another example, for the Netflix matrix: A=1, B=5
+   *
+   * For those column pairs that are above the threshold,
+   * the computed similarity is correct to within 20% relative error with 
probability
+   * at least 1 - (0.981)^(100/B)
+   *
+   * The shuffle size is bounded by the *smaller* of the following two 
expressions:
+   *
+   * O(n log(n) L / (threshold * A))
+   * O(m L^2)
+   *
+   * The latter is the cost of the brute-force approach, so for non-zero 
thresholds,
+   * the cost is always cheaper than the brute-force approach.
+   *
+   * @param threshold Similarities above this threshold are probably 
computed correctly.
--- End diff --

Elaborate more on `probably computed correctly`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579662
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -390,6 +393,113 @@ class RowMatrix(
 new RowMatrix(AB, nRows, B.numCols)
   }
 
+  /**
+   * Compute all cosine similarities between columns of this matrix using 
the brute-force
+   * approach of computing normalized dot products.
+   *
+   * @return An n x n sparse upper-triangular matrix of cosine 
similarities between
+   * columns of this matrix.
+   */
+  def columnSimilarities(): CoordinateMatrix = {
+similarColumns(0.0)
+  }
+
+  /**
+   * Compute all similarities between columns of this matrix using a 
sampling approach.
--- End diff --

`all`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579648
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -27,10 +28,12 @@ import com.github.fommil.netlib.BLAS.{getInstance = 
blas}
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.mllib.linalg._
 import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext._
 import org.apache.spark.Logging
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, 
MultivariateStatisticalSummary}
 
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579690
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala
 ---
@@ -95,6 +95,33 @@ class RowMatrixSuite extends FunSuite with 
LocalSparkContext {
 }
   }
 
+  test(similar columns) {
+val colMags = Vectors.dense(Math.sqrt(126), Math.sqrt(66), 
Math.sqrt(94))
+val expected = BDM(
+  (126.0, 54.0, 72.0),
+  (54.0, 66.0, 78.0),
+  (72.0, 78.0, 94.0))
+
+for(i - 0 until n) for(j - 0 until n) {
+  expected(i, j) /= (colMags(i) * colMags(j))
+}
+
+for (mat - Seq(denseMat, sparseMat)) {
+  val G = mat.similarColumns(150.0)
+  assert(closeToZero(G.toBreeze() - expected))
+}
+
+for (mat - Seq(denseMat, sparseMat)) {
+  val G = mat.similarColumns()
+  assert(closeToZero(G.toBreeze() - expected))
+}
+
+for (mat - Seq(denseMat, sparseMat)) {
+  val G = mat.similarColumnsDIMSUM(colMags.toArray, 150.0)
+  assert(closeToZero(G.toBreeze() - expected))
+}
+  }
+
--- End diff --

Does the test output only a subset of column pairs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579687
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.scala
 ---
@@ -53,4 +53,14 @@ trait MultivariateStatisticalSummary {
* Minimum value of each column.
*/
   def min: Vector
+
+  /**
+   * Euclidean magnitude of each column
+   */
+  def normL2: Vector
+
+  /**
+   * L1 norm of each column
+   */
+  def normL1: Vector
--- End diff --

`mean`, `stddev`, and `variance` are summary statistics of the random 
variable. The 1-norm of a random variable should be `E[|X|]` instead of `\sum_i 
|x_i|`. See http://www.math.uah.edu/stat/expect/Spaces.html . I suggest 
changing the method names to `norm1` and `norm2` and outputs the average 
instead. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...

2014-09-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1778#discussion_r17579689
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala
 ---
@@ -95,6 +95,40 @@ class RowMatrixSuite extends FunSuite with 
LocalSparkContext {
 }
   }
 
+  test(similar columns) {
+val colMags = Vectors.dense(Math.sqrt(126), Math.sqrt(66), 
Math.sqrt(94))
+val expected = BDM(
+  (0.0, 54.0, 72.0),
+  (0.0, 0.0, 78.0),
+  (0.0, 0.0, 0.0))
+
+for(i - 0 until n) for(j - 0 until n) {
+  expected(i, j) /= (colMags(i) * colMags(j))
+}
+
+for (mat - Seq(denseMat, sparseMat)) {
+  val G = mat.similarColumns(0.1).toBreeze()
+  for(i - 0 until n) for(j - 0 until n) {
--- End diff --

`for (i - 0 until n; j - 0 until n) {` (space after for and merge two for 
statements)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-55684623
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2356#issuecomment-55684719
  
@davies Thanks for working on MLlib's SerDe! It definitely simplifies 
future Python API implementations. We will wait #2378 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-15 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55685703
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55701824
  
this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2412#issuecomment-55776119
  
this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2412#issuecomment-55776094
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55795901
  
@davies Couple Python tests failed with this change. Could you fix them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55795929
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17630886
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging {
 }.toJavaRDD()
   }
 
+  private class AutoBatchedPickler(iter: Iterator[Any]) extends 
Iterator[Array[Byte]] {
+private val pickle = new Pickler()
+private var batch = 1
+private val buffer = new mutable.ArrayBuffer[Any]
+
+override def hasNext(): Boolean = iter.hasNext
+
+override def next(): Array[Byte] = {
+  while (iter.hasNext  buffer.length  batch) {
+buffer += iter.next()
+  }
+  val bytes = pickle.dumps(buffer.toArray)
+  val size = bytes.length
+  // let  1M  size  10M
+  if (size  1024 * 100) {
+batch = (1024 * 100) / size  // fast grow
--- End diff --

If the first record is small, e.g., a SparseVector with a single nonzero, 
and the records followed are large vectors, line 789 may cause memory problems. 
Does it give significant performance gain? under what circumstances?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17632544
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging {
 }.toJavaRDD()
   }
 
+  private class AutoBatchedPickler(iter: Iterator[Any]) extends 
Iterator[Array[Byte]] {
+private val pickle = new Pickler()
+private var batch = 1
+private val buffer = new mutable.ArrayBuffer[Any]
+
+override def hasNext(): Boolean = iter.hasNext
+
+override def next(): Array[Byte] = {
+  while (iter.hasNext  buffer.length  batch) {
+buffer += iter.next()
+  }
+  val bytes = pickle.dumps(buffer.toArray)
+  val size = bytes.length
+  // let  1M  size  10M
+  if (size  1024 * 100) {
+batch = (1024 * 100) / size  // fast grow
+  } else if (size  1024 * 1024) {
+batch *= 2
+  } else if (size  1024 * 1024 * 10) {
+batch /= 2
--- End diff --

If the first record is very large, `batch` will be 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-17 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-55977238
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-17 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-55977249
  
this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-55977479
  
Adding new methods to a trait is a break change. We can mark `Vector` and 
`Matrix` as sealed, so no one can extend them. From Jenkins log:

~~~
[error]  * method 
transposeTimes(org.apache.spark.mllib.linalg.DenseVector)org.apache.spark.mllib.linalg.DenseVector
 in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in 
old version
[error]filter with: 
ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.transposeTimes)
[error]  * method 
transposeTimes(org.apache.spark.mllib.linalg.DenseMatrix)org.apache.spark.mllib.linalg.DenseMatrix
 in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in 
old version
[error]filter with: 
ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.transposeTimes)
[error]  * method 
times(org.apache.spark.mllib.linalg.DenseVector)org.apache.spark.mllib.linalg.DenseVector
 in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in 
old version
[error]filter with: 
ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.times)
[error]  * method 
times(org.apache.spark.mllib.linalg.DenseMatrix)org.apache.spark.mllib.linalg.DenseMatrix
 in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in 
old version
[error]filter with: 
ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.times)
[error]  * method copy()org.apache.spark.mllib.linalg.Matrix in trait 
org.apache.spark.mllib.linalg.Matrix does not have a correspondent in old 
version
[error]filter with: 
ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.copy)
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r1770
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
--- End diff --

Doc needs update. It is a boolean now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704453
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704450
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704455
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704445
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
--- End diff --

`gemm` maybe called many times and there are cases when `alpha` is indeed 
zero. Maybe we can change it to `logDebug` to avoid outputting many log 
messages at a normal level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704447
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704457
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704449
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704452
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704454
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17704446
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709067
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709059
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709070
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709063
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709076
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -36,9 +37,42 @@ trait Matrix extends Serializable {
   /** Converts to a breeze matrix. */
   private[mllib] def toBreeze: BM[Double]
 
+  /** Gets the i-th element in the array backing the matrix. */
+  private[mllib] def apply(i: Int): Double
+
   /** Gets the (i, j)-th element. */
-  private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j)
+  private[mllib] def apply(i: Int, j: Int): Double
+
+  /** Return the index for the (i, j)-th element in the backing array. */
+  private[mllib] def index(i: Int, j: Int): Int
+
+  /** Update element at (i, j) */
+  private[mllib] def update(i: Int, j: Int, v: Double)
+
+  /** Get a deep copy of the matrix. */
+  def copy: Matrix
+
+  /** Convenience method for `Matrix`-`DenseMatrix` multiplication. */
+  def times(y: DenseMatrix): DenseMatrix = {
--- End diff --

We use `multiply` in distributed matrix and we should be consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709072
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -36,9 +37,42 @@ trait Matrix extends Serializable {
   /** Converts to a breeze matrix. */
   private[mllib] def toBreeze: BM[Double]
 
+  /** Gets the i-th element in the array backing the matrix. */
+  private[mllib] def apply(i: Int): Double
--- End diff --

Maybe it is good to remove this method. Because it may cause confusion 
between DenseMatrix and SparseMatrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709058
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709060
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709065
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709081
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -59,11 +93,113 @@ trait Matrix extends Serializable {
  */
 class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+}
+
+/**
+ * Column-majored sparse matrix.
+ * The entry values are stored in Compressed Sparse Column (CSC) format.
+ * For example, the following matrix
+ * {{{
+ *   1.0 0.0 4.0
+ *   0.0 3.0 5.0
+ *   2.0 0.0 6.0
+ * }}}
+ * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
+ * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
+ *
+ * @param numRows number of rows
+ * @param numCols number of columns
+ * @param colPtrs the index corresponding to the start of a new column
+ * @param rowIndices the row index of the entry
--- End diff --

Let's write down the contract. `rowIndices` should be strictly increasing 
for each column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709077
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -36,9 +37,42 @@ trait Matrix extends Serializable {
   /** Converts to a breeze matrix. */
   private[mllib] def toBreeze: BM[Double]
 
+  /** Gets the i-th element in the array backing the matrix. */
+  private[mllib] def apply(i: Int): Double
+
   /** Gets the (i, j)-th element. */
-  private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j)
+  private[mllib] def apply(i: Int, j: Int): Double
+
+  /** Return the index for the (i, j)-th element in the backing array. */
+  private[mllib] def index(i: Int, j: Int): Int
+
+  /** Update element at (i, j) */
+  private[mllib] def update(i: Int, j: Int, v: Double)
+
+  /** Get a deep copy of the matrix. */
+  def copy: Matrix
+
+  /** Convenience method for `Matrix`-`DenseMatrix` multiplication. */
+  def times(y: DenseMatrix): DenseMatrix = {
+val C: DenseMatrix = Matrices.zeros(numRows, 
y.numCols).asInstanceOf[DenseMatrix]
+BLAS.gemm(false, false, 1.0, this, y, 0.0, C)
+C
+  }
+
+  /** Convenience method for `Matrix`-`DenseVector` multiplication. */
+  def times(y: DenseVector): DenseVector = BLAS.gemv(1.0, this, y)
+
+  /** Convenience method for `Matrix`^T^-`DenseMatrix` multiplication. */
+  def transposeTimes(y: DenseMatrix): DenseMatrix = {
--- End diff --

another approach is to have a boolean flag in matrix indicating whether it 
is transposed or not.

~~~
A.trans().times(y)
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709069
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(sscal doesn't support vector 
type ${x.getClass}.)
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA specify whether to use matrix A, or the transpose of 
matrix A. Should be N or
+   *   n to use A, and T or t to use the transpose of A.
+   * @param transB specify whether to use matrix B, or the transpose of 
matrix B. Should be N or
+   *   n to use B, and T or t to use the transpose of B.
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logWarning(gemm: alpha is equal to 0. Returning C.)
+} else {
+  A match {
+case sparse: SparseMatrix =
+  gemm(transA, transB, alpha, sparse, B, beta, C)
+case dense: DenseMatrix =
+  gemm(transA, transB, alpha, dense, B, beta, C)
+case _ =
+  throw new IllegalArgumentException(sgemm doesn't support matrix 
type ${A.getClass}.)
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) N else T
+val tBstr = if (!transB) N else T
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+require(kA == kB, sThe columns of A don't match the rows of B. A: 
$kA, B: $kB)
+require(mA == C.numRows, sThe rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA)
+require(nB == C.numCols,
+  sThe columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB)
+
+val Avals = A.values
+val Arows = if (!transA) A.rowIndices else A.colPtrs
+val Acols = if (!transA) A.colPtrs else A.rowIndices
+
+// Slicing is easy in this case. This is the optimal

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709101
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(m, n, colIndices, rowIndices, 
values).asInstanceOf[SparseMatrix]
+assert(mat.numRows === m)
+assert(mat.numCols === n)
+assert(mat.values.eq(values), should not copy data)
+  }
+
+  test(sparse matrix construction with wrong number of elements) {
+intercept[RuntimeException] {
--- End diff --

It should be `IllegalArgumentException` (thrown by `require`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709089
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -83,6 +219,24 @@ object Matrices {
   }
 
   /**
+   * Creates a column-majored sparse matrix in Compressed Sparse Column 
(CSC) format.
+   *
+   * @param numRows number of rows
+   * @param numCols number of columns
+   * @param colPointers the index corresponding to the start of a new 
column
+   * @param rowIndices the row index of the entry
+   * @param values non-zero matrix entries in column major
+   */
+  def sparse(
+ numRows: Int,
+ numCols: Int,
+ colPointers: Array[Int],
--- End diff --

`colPtrs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709102
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(m, n, colIndices, rowIndices, 
values).asInstanceOf[SparseMatrix]
+assert(mat.numRows === m)
+assert(mat.numCols === n)
+assert(mat.values.eq(values), should not copy data)
+  }
+
+  test(sparse matrix construction with wrong number of elements) {
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 
2.0))
+}
+
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 
2.0))
+}
+  }
+
+  test(matrix copies are deep copies) {
+val m = 3
+val n = 2
+
+val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 
5.0))
+val denseCopy = denseMat.copy
+
+assert(!denseMat.toArray.eq(denseCopy.toArray))
+
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
--- End diff --

`colPtrs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709086
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -59,11 +93,113 @@ trait Matrix extends Serializable {
  */
 class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+}
+
+/**
+ * Column-majored sparse matrix.
+ * The entry values are stored in Compressed Sparse Column (CSC) format.
+ * For example, the following matrix
+ * {{{
+ *   1.0 0.0 4.0
+ *   0.0 3.0 5.0
+ *   2.0 0.0 6.0
+ * }}}
+ * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
+ * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
+ *
+ * @param numRows number of rows
+ * @param numCols number of columns
+ * @param colPtrs the index corresponding to the start of a new column
+ * @param rowIndices the row index of the entry
+ * @param values non-zero matrix entries in column major
+ */
+class SparseMatrix(
+val numRows: Int,
+val numCols: Int,
+val colPtrs: Array[Int],
+val rowIndices: Array[Int],
+val values: Array[Double]) extends Matrix {
+
+  require(values.length == rowIndices.length, The number of row indices 
and values don't match!  +
+svalues.length: ${values.length}, rowIndices.length: 
${rowIndices.length})
+  require(colPtrs.length == numCols + 1, The length of the column indices 
should be the  +
+snumber of columns + 1. Currently, colPointers.length: 
${colPtrs.length},  +
+snumCols: $numCols)
+
+  override def toArray: Array[Double] = {
+val arr = Array.fill(numRows * numCols)(0.0)
+var j = 0
+while (j  numCols) {
+  var i = colPtrs(j)
+  val indEnd = colPtrs(j + 1)
+  val offset = j * numRows
+  while (i  indEnd) {
+val rowIndex = rowIndices(i)
+arr(offset + rowIndex) = values(i)
+i += 1
+  }
+  j += 1
+}
+arr
+  }
+
+  private[mllib] def toBreeze: BM[Double] =
+new BSM[Double](values, numRows, numCols, colPtrs, rowIndices)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = {
+val ind = index(i, j)
+if (ind == -1) 0.0 else values(ind)
+  }
+
+  private[mllib] def index(i: Int, j: Int): Int = {
+var regionStart = colPtrs(j)
--- End diff --

use `Arrays.binarySearch`: 
http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#binarySearch(int[],%20int,%20int,%20int)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709099
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(m, n, colIndices, rowIndices, 
values).asInstanceOf[SparseMatrix]
+assert(mat.numRows === m)
+assert(mat.numCols === n)
+assert(mat.values.eq(values), should not copy data)
--- End diff --

also check `colPtrs` and `rowIndices`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709094
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeMatrixConversionSuite.scala
 ---
@@ -37,4 +37,26 @@ class BreezeMatrixConversionSuite extends FunSuite {
 assert(mat.numCols === breeze.cols)
 assert(mat.values.eq(breeze.data), should not copy data)
   }
+
+  test(sparse matrix to breeze) {
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colPointers = Array(0, 2, 4)
--- End diff --

`colPtrs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709082
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -59,11 +93,113 @@ trait Matrix extends Serializable {
  */
 class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+}
+
+/**
+ * Column-majored sparse matrix.
+ * The entry values are stored in Compressed Sparse Column (CSC) format.
+ * For example, the following matrix
+ * {{{
+ *   1.0 0.0 4.0
+ *   0.0 3.0 5.0
+ *   2.0 0.0 6.0
+ * }}}
+ * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
+ * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
+ *
+ * @param numRows number of rows
+ * @param numCols number of columns
+ * @param colPtrs the index corresponding to the start of a new column
+ * @param rowIndices the row index of the entry
+ * @param values non-zero matrix entries in column major
+ */
+class SparseMatrix(
+val numRows: Int,
+val numCols: Int,
+val colPtrs: Array[Int],
+val rowIndices: Array[Int],
+val values: Array[Double]) extends Matrix {
+
+  require(values.length == rowIndices.length, The number of row indices 
and values don't match!  +
+svalues.length: ${values.length}, rowIndices.length: 
${rowIndices.length})
+  require(colPtrs.length == numCols + 1, The length of the column indices 
should be the  +
+snumber of columns + 1. Currently, colPointers.length: 
${colPtrs.length},  +
+snumCols: $numCols)
+
+  override def toArray: Array[Double] = {
+val arr = Array.fill(numRows * numCols)(0.0)
--- End diff --

`new Array[Double](numRows * numCols)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709106
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(m, n, colIndices, rowIndices, 
values).asInstanceOf[SparseMatrix]
+assert(mat.numRows === m)
+assert(mat.numCols === n)
+assert(mat.values.eq(values), should not copy data)
+  }
+
+  test(sparse matrix construction with wrong number of elements) {
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 
2.0))
+}
+
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 
2.0))
+}
+  }
+
+  test(matrix copies are deep copies) {
+val m = 3
+val n = 2
+
+val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 
5.0))
+val denseCopy = denseMat.copy
+
+assert(!denseMat.toArray.eq(denseCopy.toArray))
+
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val sparseMat = Matrices.sparse(m, n, colIndices, rowIndices, values)
+val sparseCopy = sparseMat.copy
+
+assert(!sparseMat.toArray.eq(sparseCopy.toArray))
+  }
+
+  test(matrix indexing and updating) {
+val m = 3
+val n = 2
+val allValues = Array(0.0, 1.0, 2.0, 3.0, 4.0, 0.0)
+
+val denseMat = new DenseMatrix(m, n, allValues)
+
+assert(denseMat(0, 1) == 3.0)
+assert(denseMat(0, 1) == denseMat.values(3))
+assert(denseMat(0, 1) == denseMat(3))
+assert(denseMat(0, 0) == 0.0)
+
+denseMat.update(0, 0, 10.0)
+assert(denseMat(0, 0) == 10.0)
+assert(denseMat.values(0) == 10.0)
+
+val sparseValues = Array(1.0, 2.0, 3.0, 4.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 0, 1)
+val sparseMat = new SparseMatrix(m, n, colIndices, rowIndices, 
sparseValues)
+
+assert(sparseMat(0, 1) == 3.0)
+assert(sparseMat(0, 1) == sparseMat.values(2))
+assert(sparseMat(0, 1) == sparseMat(2))
+assert(sparseMat(0, 0) == 0.0)
+
+intercept[IllegalArgumentException] {
+  sparseMat.update(0, 0, 10.0)
+}
+
+sparseMat.update(0, 1, 10.0)
+assert(sparseMat(0, 1) == 10.0)
--- End diff --

`==` - `===` (please update others as well)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709079
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -59,11 +93,113 @@ trait Matrix extends Serializable {
  */
 class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
--- End diff --

`: Unit = {`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709096
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeMatrixConversionSuite.scala
 ---
@@ -37,4 +37,26 @@ class BreezeMatrixConversionSuite extends FunSuite {
 assert(mat.numCols === breeze.cols)
 assert(mat.values.eq(breeze.data), should not copy data)
   }
+
+  test(sparse matrix to breeze) {
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colPointers = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(3, 2, colPointers, rowIndices, values)
+val breeze = mat.toBreeze.asInstanceOf[BSM[Double]]
+assert(breeze.rows === mat.numRows)
+assert(breeze.cols === mat.numCols)
+assert(breeze.data.eq(mat.asInstanceOf[SparseMatrix].values), should 
not copy data)
+  }
+
+  test(sparse breeze matrix to sparse matrix) {
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colPointers = Array(0, 2, 4)
--- End diff --

`colPtrs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709085
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -59,11 +93,113 @@ trait Matrix extends Serializable {
  */
 class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+}
+
+/**
+ * Column-majored sparse matrix.
+ * The entry values are stored in Compressed Sparse Column (CSC) format.
+ * For example, the following matrix
+ * {{{
+ *   1.0 0.0 4.0
+ *   0.0 3.0 5.0
+ *   2.0 0.0 6.0
+ * }}}
+ * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
+ * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
+ *
+ * @param numRows number of rows
+ * @param numCols number of columns
+ * @param colPtrs the index corresponding to the start of a new column
+ * @param rowIndices the row index of the entry
+ * @param values non-zero matrix entries in column major
+ */
+class SparseMatrix(
+val numRows: Int,
+val numCols: Int,
+val colPtrs: Array[Int],
+val rowIndices: Array[Int],
+val values: Array[Double]) extends Matrix {
+
+  require(values.length == rowIndices.length, The number of row indices 
and values don't match!  +
+svalues.length: ${values.length}, rowIndices.length: 
${rowIndices.length})
+  require(colPtrs.length == numCols + 1, The length of the column indices 
should be the  +
+snumber of columns + 1. Currently, colPointers.length: 
${colPtrs.length},  +
+snumCols: $numCols)
+
+  override def toArray: Array[Double] = {
+val arr = Array.fill(numRows * numCols)(0.0)
+var j = 0
+while (j  numCols) {
+  var i = colPtrs(j)
+  val indEnd = colPtrs(j + 1)
+  val offset = j * numRows
+  while (i  indEnd) {
+val rowIndex = rowIndices(i)
+arr(offset + rowIndex) = values(i)
+i += 1
+  }
+  j += 1
+}
+arr
+  }
+
+  private[mllib] def toBreeze: BM[Double] =
+new BSM[Double](values, numRows, numCols, colPtrs, rowIndices)
+
+  private[mllib] def apply(i: Int): Double = values(i)
--- End diff --

ditto: remove this method and use `.values(i)` directly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709088
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -59,11 +93,113 @@ trait Matrix extends Serializable {
  */
 class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+}
+
+/**
+ * Column-majored sparse matrix.
+ * The entry values are stored in Compressed Sparse Column (CSC) format.
+ * For example, the following matrix
+ * {{{
+ *   1.0 0.0 4.0
+ *   0.0 3.0 5.0
+ *   2.0 0.0 6.0
+ * }}}
+ * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
+ * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
+ *
+ * @param numRows number of rows
+ * @param numCols number of columns
+ * @param colPtrs the index corresponding to the start of a new column
+ * @param rowIndices the row index of the entry
+ * @param values non-zero matrix entries in column major
+ */
+class SparseMatrix(
+val numRows: Int,
+val numCols: Int,
+val colPtrs: Array[Int],
+val rowIndices: Array[Int],
+val values: Array[Double]) extends Matrix {
+
+  require(values.length == rowIndices.length, The number of row indices 
and values don't match!  +
+svalues.length: ${values.length}, rowIndices.length: 
${rowIndices.length})
+  require(colPtrs.length == numCols + 1, The length of the column indices 
should be the  +
+snumber of columns + 1. Currently, colPointers.length: 
${colPtrs.length},  +
+snumCols: $numCols)
+
+  override def toArray: Array[Double] = {
+val arr = Array.fill(numRows * numCols)(0.0)
+var j = 0
+while (j  numCols) {
+  var i = colPtrs(j)
+  val indEnd = colPtrs(j + 1)
+  val offset = j * numRows
+  while (i  indEnd) {
+val rowIndex = rowIndices(i)
+arr(offset + rowIndex) = values(i)
+i += 1
+  }
+  j += 1
+}
+arr
+  }
+
+  private[mllib] def toBreeze: BM[Double] =
+new BSM[Double](values, numRows, numCols, colPtrs, rowIndices)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = {
+val ind = index(i, j)
+if (ind == -1) 0.0 else values(ind)
+  }
+
+  private[mllib] def index(i: Int, j: Int): Int = {
+var regionStart = colPtrs(j)
+var regionEnd = colPtrs(j + 1)
+while (regionStart = regionEnd) {
+  val mid = (regionStart + regionEnd) / 2
+  if (rowIndices(mid) == i){
+return mid
+  } else if (regionStart == regionEnd) {
+return -1
+  } else if (rowIndices(mid)  i) {
+regionEnd = mid
+  } else {
+regionStart = mid
+  }
+}
+-1
+  }
+
+  private[mllib] def update(i: Int, j: Int, v: Double){
--- End diff --

`: Unit = {`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709092
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -93,9 +247,84 @@ object Matrices {
 require(dm.majorStride == dm.rows,
   Do not support stride size different from the number of rows.)
 new DenseMatrix(dm.rows, dm.cols, dm.data)
+  case sm: BSM[Double] =
+new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, 
sm.data)
   case _ =
 throw new UnsupportedOperationException(
   sDo not support conversion from type 
${breeze.getClass.getName}.)
 }
   }
+
+  /**
+   * Generate a `DenseMatrix` consisting of zeros.
+   * @param numRows number of rows of the matrix
+   * @param numCols number of columns of the matrix
+   * @return `DenseMatrix` with size `numRows` x `numCols` and values of 
zeros
+   */
+  def zeros(numRows: Int, numCols: Int): Matrix =
+new DenseMatrix(numRows, numCols, Array.fill(numRows * numCols)(0.0))
--- End diff --

`new Array` instead of `Array.fill`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709103
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(m, n, colIndices, rowIndices, 
values).asInstanceOf[SparseMatrix]
+assert(mat.numRows === m)
+assert(mat.numCols === n)
+assert(mat.values.eq(values), should not copy data)
+  }
+
+  test(sparse matrix construction with wrong number of elements) {
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 
2.0))
+}
+
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 
2.0))
+}
+  }
+
+  test(matrix copies are deep copies) {
+val m = 3
+val n = 2
+
+val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 
5.0))
+val denseCopy = denseMat.copy
+
+assert(!denseMat.toArray.eq(denseCopy.toArray))
+
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val sparseMat = Matrices.sparse(m, n, colIndices, rowIndices, values)
+val sparseCopy = sparseMat.copy
+
+assert(!sparseMat.toArray.eq(sparseCopy.toArray))
+  }
+
+  test(matrix indexing and updating) {
+val m = 3
+val n = 2
+val allValues = Array(0.0, 1.0, 2.0, 3.0, 4.0, 0.0)
+
+val denseMat = new DenseMatrix(m, n, allValues)
+
+assert(denseMat(0, 1) == 3.0)
+assert(denseMat(0, 1) == denseMat.values(3))
+assert(denseMat(0, 1) == denseMat(3))
+assert(denseMat(0, 0) == 0.0)
+
+denseMat.update(0, 0, 10.0)
+assert(denseMat(0, 0) == 10.0)
+assert(denseMat.values(0) == 10.0)
+
+val sparseValues = Array(1.0, 2.0, 3.0, 4.0)
+val colIndices = Array(0, 2, 4)
--- End diff --

`colPtrs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709097
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
--- End diff --

`colPtrs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709104
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite {
   Matrices.dense(3, 2, Array(0.0, 1.0, 2.0))
 }
   }
+
+  test(sparse matrix construction) {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val mat = Matrices.sparse(m, n, colIndices, rowIndices, 
values).asInstanceOf[SparseMatrix]
+assert(mat.numRows === m)
+assert(mat.numCols === n)
+assert(mat.values.eq(values), should not copy data)
+  }
+
+  test(sparse matrix construction with wrong number of elements) {
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 
2.0))
+}
+
+intercept[RuntimeException] {
+  Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 
2.0))
+}
+  }
+
+  test(matrix copies are deep copies) {
+val m = 3
+val n = 2
+
+val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 
5.0))
+val denseCopy = denseMat.copy
+
+assert(!denseMat.toArray.eq(denseCopy.toArray))
+
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 1, 2)
+val sparseMat = Matrices.sparse(m, n, colIndices, rowIndices, values)
+val sparseCopy = sparseMat.copy
+
+assert(!sparseMat.toArray.eq(sparseCopy.toArray))
+  }
+
+  test(matrix indexing and updating) {
+val m = 3
+val n = 2
+val allValues = Array(0.0, 1.0, 2.0, 3.0, 4.0, 0.0)
+
+val denseMat = new DenseMatrix(m, n, allValues)
+
+assert(denseMat(0, 1) == 3.0)
+assert(denseMat(0, 1) == denseMat.values(3))
+assert(denseMat(0, 1) == denseMat(3))
+assert(denseMat(0, 0) == 0.0)
+
+denseMat.update(0, 0, 10.0)
+assert(denseMat(0, 0) == 10.0)
+assert(denseMat.values(0) == 10.0)
+
+val sparseValues = Array(1.0, 2.0, 3.0, 4.0)
+val colIndices = Array(0, 2, 4)
+val rowIndices = Array(1, 2, 0, 1)
+val sparseMat = new SparseMatrix(m, n, colIndices, rowIndices, 
sparseValues)
+
+assert(sparseMat(0, 1) == 3.0)
+assert(sparseMat(0, 1) == sparseMat.values(2))
+assert(sparseMat(0, 1) == sparseMat(2))
+assert(sparseMat(0, 0) == 0.0)
+
+intercept[IllegalArgumentException] {
--- End diff --

Should throw `NoSuchElementException`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-17 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2294#discussion_r17709093
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/BLASSuite.scala ---
@@ -126,4 +126,116 @@ class BLASSuite extends FunSuite {
   }
 }
   }
+
+  test(gemm) {
+
+val dA =
+  new DenseMatrix(4, 3, Array(0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 
0.0, 0.0, 0.0, 3.0))
+val sA = new SparseMatrix(4, 3, Array(0, 1, 3, 4), Array(1, 0, 2, 3), 
Array(1.0, 2.0, 1.0, 3.0))
+
+val B = new DenseMatrix(3, 2, Array(1.0, 0.0, 0.0, 0.0, 2.0, 1.0))
+val expected = new DenseMatrix(4, 2, Array(0.0, 1.0, 0.0, 0.0, 4.0, 
0.0, 2.0, 3.0))
+
+assert(dA times B ~== expected absTol 1e-15)
+assert(sA times B ~== expected absTol 1e-15)
+
+val C1 = new DenseMatrix(4, 2, Array(1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 
1.0, 0.0))
+val C2 = C1.copy
+val C3 = C1.copy
+val C4 = C1.copy
+val C5 = C1.copy
+val C6 = C1.copy
+val C7 = C1.copy
+val C8 = C1.copy
+val expected2 = new DenseMatrix(4, 2, Array(2.0, 1.0, 4.0, 2.0, 4.0, 
0.0, 4.0, 3.0))
+val expected3 = new DenseMatrix(4, 2, Array(2.0, 2.0, 4.0, 2.0, 8.0, 
0.0, 6.0, 6.0))
+
+gemm(1.0, dA, B, 2.0, C1)
+gemm(1.0, sA, B, 2.0, C2)
+gemm(2.0, dA, B, 2.0, C3)
+gemm(2.0, sA, B, 2.0, C4)
+assert(C1 ~== expected2 absTol 1e-15)
+assert(C2 ~== expected2 absTol 1e-15)
+assert(C3 ~== expected3 absTol 1e-15)
+assert(C4 ~== expected3 absTol 1e-15)
+
+withClue(columns of A don't match the rows of B) {
+  intercept[Exception] {
+gemm(true, false, 1.0, dA, B, 2.0, C1)
+  }
+}
+
+val dAT =
+  new DenseMatrix(3, 4, Array(0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 
0.0, 0.0, 0.0, 3.0))
+val sAT =
+  new SparseMatrix(3, 4, Array(0, 1, 2, 3, 4), Array(1, 0, 1, 2), 
Array(2.0, 1.0, 1.0, 3.0))
+
+assert(dAT transposeTimes B ~== expected absTol 1e-15)
+assert(sAT transposeTimes B ~== expected absTol 1e-15)
+
+gemm(true, false, 1.0, dAT, B, 2.0, C5)
+gemm(true, false, 1.0, sAT, B, 2.0, C6)
+gemm(true, false, 2.0, dAT, B, 2.0, C7)
+gemm(true, false, 2.0, sAT, B, 2.0, C8)
+assert(C5 ~== expected2 absTol 1e-15)
+assert(C6 ~== expected2 absTol 1e-15)
+assert(C7 ~== expected3 absTol 1e-15)
+assert(C8 ~== expected3 absTol 1e-15)
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56135525
  
@JoshRosen PySpark/MLlib requires NumPy to run, and I don't think we 
claimed that we support different versions of NumPy.

`sample()` in core is different. Maybe we can test the overhead of using 
Python's own random. The overhead may not be huge, due to 
serialization/deserialization.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-18 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2455#discussion_r17769391
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala ---
@@ -43,66 +46,218 @@ trait RandomSampler[T, U] extends Pseudorandom with 
Cloneable with Serializable
 throw new NotImplementedError(clone() is not implemented.)
 }
 
+
+object RandomSampler {
+  // Default random number generator used by random samplers
+  def rngDefault: Random = new XORShiftRandom
+
+  // Default gap sampling maximum
+  // For sampling fractions = this value, the gap sampling optimization 
will be applied.
+  // Above this value, it is assumed that tradtional bernoulli sampling 
is faster.  The 
+  // optimal value for this will depend on the RNG.  More expensive RNGs 
will tend to make
+  // the optimal value higher.  The most reliable way to determine this 
value for a given RNG
+  // is to experiment.  I would expect a value of 0.5 to be close in most 
cases.
+  def gsmDefault: Double = 0.4
+
+  // Default gap sampling epsilon
+  // When sampling random floating point values the gap sampling logic 
requires value  0.  An
+  // optimal value for this parameter is at or near the minimum positive 
floating point value 
+  // returned by nextDouble() for the RNG being used.
+  def epsDefault: Double = 5e-11
+}
+
+
 /**
  * :: DeveloperApi ::
  * A sampler based on Bernoulli trials.
  *
- * @param lb lower bound of the acceptance range
- * @param ub upper bound of the acceptance range
- * @param complement whether to use the complement of the range specified, 
default to false
+ * @param fraction the sampling fraction, aka Bernoulli sampling 
probability
  * @tparam T item type
  */
 @DeveloperApi
-class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = 
false)
-  extends RandomSampler[T, T] {
+class BernoulliSampler[T: ClassTag](fraction: Double) extends 
RandomSampler[T, T] {
 
-  private[random] var rng: Random = new XORShiftRandom
+  require(fraction = 0.0fraction = 1.0, Sampling fraction must be 
on interval [0, 1])
 
-  def this(ratio: Double) = this(0.0d, ratio)
+  def this(lb: Double, ub: Double, complement: Boolean = false) =
+this(if (complement) (1.0 - (ub - lb)) else (ub - lb))
+
+  private val rng: Random = RandomSampler.rngDefault
 
   override def setSeed(seed: Long) = rng.setSeed(seed)
 
   override def sample(items: Iterator[T]): Iterator[T] = {
-items.filter { item =
-  val x = rng.nextDouble()
-  (x = lb  x  ub) ^ complement
+fraction match {
+  case f if (f = 0.0) = Iterator.empty
+  case f if (f = 1.0) = items
+  case f if (f = RandomSampler.gsmDefault) =
+new GapSamplingIterator(items, f, rng, RandomSampler.epsDefault)
+  case _ = items.filter(_ = (rng.nextDouble() = fraction))
--- End diff --

Did you test whether `rdd.randomSplit()` will produce non-overlapping 
subsets with this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-18 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-56136224
  
LGTM. I'm merging this into master. (We might need to make slight changes 
to some methods before the 1.2 release, but let's not block the multi-model 
training PR for now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56136476
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] fix a unresolved reference variable 'n...

2014-09-18 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2423#issuecomment-56136584
  
@OdinLin Thanks for catching the bug! As @davies mentioned, #2378 will 
completely replace the current SerDe. Could you close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-18 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-56136714
  
@derrickburns I cannot see the Jenkins log. Let's call Jenkins again.

test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2455#issuecomment-56144570
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2455#issuecomment-56144582
  
this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56147622
  
@davies Does `PickleSerializer` compress data? If not, maybe we should 
cache the deserialized RDD instead of the one from `_.reserialize`. They have 
the same storage. I understand that batch-serialization can help GC. But 
algorithms like linear methods should only allocate short-lived objects. Is 
batch-serialization worth the tradeoff?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-56235934
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8762 matches

Mail list logo