[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2341#discussion_r17465271 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -87,17 +87,11 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo val maxDepth = strategy.maxDepth require(maxDepth = 30, sDecisionTree currently only supports maxDepth = 30, but was given maxDepth = $maxDepth.) -// Number of nodes to allocate: max number of nodes possible given the depth of the tree, plus 1 -val maxNumNodesPlus1 = Node.startIndexInLevel(maxDepth + 1) -// Initialize an array to hold parent impurity calculations for each node. -val parentImpurities = new Array[Double](maxNumNodesPlus1) -// dummy value for top node (updated during first split calculation) -val nodes = new Array[Node](maxNumNodesPlus1) // Calculate level for single group construction // Max memory usage for aggregates -val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024 +val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L --- End diff -- It is also useful to set an upper bound here (e.g., 1GB) to avoid memory/GC problems on the driver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2341#discussion_r17466105 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -120,81 +114,35 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * beforehand and is not used in later levels. */ +var topNode: Node = null // set on first iteration var level = 0 var break = false while (level = maxDepth !break) { - logDebug(#) logDebug(level = + level) logDebug(#) // Find best split for all nodes at a level. timer.start(findBestSplits) - val splitsStatsForLevel: Array[(Split, InformationGainStats, Predict)] = -DecisionTree.findBestSplits(treeInput, parentImpurities, - metadata, level, nodes, splits, bins, maxLevelForSingleGroup, timer) + val (tmpTopNode: Node, doneTraining: Boolean) = DecisionTree.findBestSplits(treeInput, +metadata, level, topNode, splits, bins, maxLevelForSingleGroup, timer) timer.stop(findBestSplits) - val levelNodeIndexOffset = Node.startIndexInLevel(level) - for ((nodeSplitStats, index) - splitsStatsForLevel.view.zipWithIndex) { -val nodeIndex = levelNodeIndexOffset + index - -// Extract info for this node (index) at the current level. -timer.start(extractNodeInfo) -val split = nodeSplitStats._1 -val stats = nodeSplitStats._2 -val predict = nodeSplitStats._3.predict -val isLeaf = (stats.gain = 0) || (level == strategy.maxDepth) -val node = new Node(nodeIndex, predict, isLeaf, Some(split), None, None, Some(stats)) -logDebug(Node = + node) -nodes(nodeIndex) = node -timer.stop(extractNodeInfo) - -if (level != 0) { - // Set parent. - val parentNodeIndex = Node.parentIndex(nodeIndex) - if (Node.isLeftChild(nodeIndex)) { -nodes(parentNodeIndex).leftNode = Some(nodes(nodeIndex)) - } else { -nodes(parentNodeIndex).rightNode = Some(nodes(nodeIndex)) - } -} -// Extract info for nodes at the next lower level. -timer.start(extractInfoForLowerLevels) -if (level maxDepth) { - val leftChildIndex = Node.leftChildIndex(nodeIndex) - val leftImpurity = stats.leftImpurity - logDebug(leftChildIndex = + leftChildIndex + , impurity = + leftImpurity) - parentImpurities(leftChildIndex) = leftImpurity - - val rightChildIndex = Node.rightChildIndex(nodeIndex) - val rightImpurity = stats.rightImpurity - logDebug(rightChildIndex = + rightChildIndex + , impurity = + rightImpurity) - parentImpurities(rightChildIndex) = rightImpurity -} -timer.stop(extractInfoForLowerLevels) -logDebug(final best split = + split) + if (level == 0) { +topNode = tmpTopNode } - require(Node.maxNodesInLevel(level) == splitsStatsForLevel.length) - // Check whether all the nodes at the current level at leaves. - val allLeaf = splitsStatsForLevel.forall(_._2.gain = 0) - logDebug(all leaf = + allLeaf) - if (allLeaf) { -break = true // no more tree construction - } else { -level += 1 + if (doneTraining) { +break = true --- End diff -- Shall we remove `break` and only use `doneTraining`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2341#discussion_r17466108 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -435,18 +385,18 @@ object DecisionTree extends Serializable with Logging { // numGroups is equal to 2 at level 11 and 4 at level 12, respectively. val numGroups = 1 level - maxLevelForSingleGroup logDebug(numGroups = + numGroups) - var bestSplits = new Array[(Split, InformationGainStats, Predict)](0) // Iterate over each group of nodes at a level. var groupIndex = 0 + var doneTraining = true while (groupIndex numGroups) { -val bestSplitsForGroup = findBestSplitsPerGroup(input, parentImpurities, metadata, level, - nodes, splits, bins, timer, numGroups, groupIndex) -bestSplits = Array.concat(bestSplits, bestSplitsForGroup) +val (tmpRoot, doneTrainingGroup) = findBestSplitsPerGroup(input, metadata, level, --- End diff -- `tmpRoot` - `_` (since it is not used after) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55374976 @davies It is hard to tell whether we already have fast access to the input RDD. Force caching may cause problems, e.g., 1. kicking out some cached RDDs, 2. using too much memory if the input data is large but it could be generated from a small RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-55375085 @davies Could you take a look at this PR and see whether there is an easier way for SerDe? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2341#issuecomment-55375636 LGTM except minor inline comments. I'm merging this in and could you make the changes with your next update? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3396][MLLIB] Use SquaredL2Updater in Lo...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2231#issuecomment-55376180 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]Collapsed Gibbs sampli...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1983#issuecomment-55381531 @witgo @allwefantasy English | èªå¨ç¿»è¯çä¸æ | Let's try to keep the comments in English as much as possible. | 让æ们尽éä¿ææè§çè±æå°½å¯è½ã If you feel hard or impossible to express something in English, maybe we can try putting Chinese and auto-translated English side-by-side. | å¦æä½ è§å¾å¾é¾ææ æ³è¡¨è¾¾çä¸è¥¿ç¨è±æï¼ä¹è®¸æ们å¯ä»¥å°è¯æä¸å½åèªå¨ç¿»è¯è±æ并æ侧ã This is definitely better than writing nothing and also help non-Chinese developers understand what is going on at a high level. | è¿ç»å¯¹æ¯æ¯åä»ä¹å¥½ï¼ä¹æå©äºéä¸å½å¼ååæç½ä»ä¹æ¯å¨ä¸ä¸ªè¾é«çæ°´å¹³æä¹åäºã I'm trying to demo this by putting Google-translated Chinese on the right side. | æè¯å¾éè¿å°è°·æç¿»è¯çä¸å½å³ä¾§æ¥æ¼ç¤ºè¿ä¸ç¹ã It is definitely not high-quality but better than nothing. | è¿ç»å¯¹ä¸æ¯é«åè´¨çï¼ä½æ»æ¯æ²¡æ好ã --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]Collapsed Gibbs sampli...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1983#issuecomment-55381621 @witgo @allwefantasy We had an offline discussion about LDA's implementation. Please check the JIRA page for the notes. -- æ们æ大约LDAçå®ç°è±æºè®¨è®ºã请æ£æ¥JIRA页ç注éã --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-55383567 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3396][MLLIB] Use SquaredL2Updater in Lo...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2231#issuecomment-55428631 @BigCrunsh Just saw that the target is `branch-1.0`. Could you change the target to `master`? Usually we first apply the patch to master and then backport it to old branches. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [mllib] DecisionTree: Add minInstancesPerNode,...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2349#issuecomment-55429412 @jkbradley This contains API changes to python. Could you create a JIRA for it? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574390 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -17,16 +17,18 @@ package org.apache.spark.mllib.api.python -import java.nio.{ByteBuffer, ByteOrder} +import java.io.OutputStream import scala.collection.JavaConverters._ +import net.razorvine.pickle.{Pickler, Unpickler, IObjectConstructor, IObjectPickler, PickleException, Opcodes} --- End diff -- use `_` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574399 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) --- End diff -- Does it increase the storage cost by a lot for small objects? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574385 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -778,8 +778,8 @@ private[spark] object PythonRDD extends Logging { def javaToPython(jRDD: JavaRDD[Any]): JavaRDD[Array[Byte]] = { jRDD.rdd.mapPartitions { iter = val pickle = new Pickler - iter.map { row = -pickle.dumps(row) + iter.grouped(1024).map { rows = --- End diff -- Shall we divide groups based on the serialized size? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574396 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, --- End diff -- use camelCase for method names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574404 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) +out.write(Opcodes.MARK) +objects.foreach(pickler.save(_)) +out.write(Opcodes.TUPLE) +out.write(Opcodes.REDUCE) + } - def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance) + private[python] class DenseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val vector: DenseVector = obj.asInstanceOf[DenseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, vector.toArray) --- End diff -- ditto: what is the cost of using class names? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574578 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -60,18 +60,18 @@ class PythonMLLibAPI extends Serializable { def loadLabeledPoints( jsc: JavaSparkContext, path: String, - minPartitions: Int): JavaRDD[Array[Byte]] = -MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions).map(SerDe.serializeLabeledPoint) + minPartitions: Int): JavaRDD[LabeledPoint] = +MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions) private def trainRegressionModel( trainFunc: (RDD[LabeledPoint], Vector) = GeneralizedLinearModel, - dataBytesJRDD: JavaRDD[Array[Byte]], + dataJRDD: JavaRDD[Any], initialWeightsBA: Array[Byte]): java.util.LinkedList[java.lang.Object] = { -val data = dataBytesJRDD.rdd.map(SerDe.deserializeLabeledPoint) -val initialWeights = SerDe.deserializeDoubleVector(initialWeightsBA) +val data = dataJRDD.rdd.map(_.asInstanceOf[LabeledPoint]) --- End diff -- maybe we can try `dataJRDD.rdd.asInstanceOf[RDD[LabeledPoint]]` instead of `map` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574784 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) +out.write(Opcodes.MARK) +objects.foreach(pickler.save(_)) +out.write(Opcodes.TUPLE) +out.write(Opcodes.REDUCE) + } - def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance) + private[python] class DenseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val vector: DenseVector = obj.asInstanceOf[DenseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, vector.toArray) +} + } - def count: Long = summary.count + private[python] class DenseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 1) + new DenseVector(args(0).asInstanceOf[Array[Double]]) +} + } + + private[python] class DenseMatrixPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val m: DenseMatrix = obj.asInstanceOf[DenseMatrix] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix, +m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], m.values) +} + } - def numNonzeros: Array[Byte] = SerDe.serializeDoubleVector(summary.numNonzeros) + private[python] class DenseMatrixConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int], +args(2).asInstanceOf[Array[Double]]) +} + } - def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max) + private[python] class SparseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val v: SparseVector = obj.asInstanceOf[SparseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector, +v.size.asInstanceOf[Object], v.indices, v.values) +} + } - def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min) -} + private[python] class SparseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new SparseVector(args(0).asInstanceOf[Int], args(1).asInstanceOf[Array[Int]], +args(2).asInstanceOf[Array[Double]]) +} + } -/** - * SerDe utility functions for PythonMLLibAPI. - */ -private[spark] object SerDe extends Serializable { - private val DENSE_VECTOR_MAGIC: Byte = 1 - private val SPARSE_VECTOR_MAGIC: Byte = 2 - private val DENSE_MATRIX_MAGIC: Byte = 3 - private val LABELED_POINT_MAGIC: Byte = 4 - - private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: Int = 0): Vector = { -require(bytes.length - offset = 5, Byte array too short) -val magic = bytes(offset) -if (magic == DENSE_VECTOR_MAGIC) { - deserializeDenseVector(bytes, offset) -} else if (magic == SPARSE_VECTOR_MAGIC) { - deserializeSparseVector(bytes, offset) -} else { - throw new IllegalArgumentException(Magic + magic + is wrong.) + private[python] class LabeledPointPickler extends IObjectPickler { +def pickle
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574827 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) +out.write(Opcodes.MARK) +objects.foreach(pickler.save(_)) +out.write(Opcodes.TUPLE) +out.write(Opcodes.REDUCE) + } - def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance) + private[python] class DenseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val vector: DenseVector = obj.asInstanceOf[DenseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, vector.toArray) +} + } - def count: Long = summary.count + private[python] class DenseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 1) + new DenseVector(args(0).asInstanceOf[Array[Double]]) +} + } + + private[python] class DenseMatrixPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val m: DenseMatrix = obj.asInstanceOf[DenseMatrix] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix, +m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], m.values) +} + } - def numNonzeros: Array[Byte] = SerDe.serializeDoubleVector(summary.numNonzeros) + private[python] class DenseMatrixConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int], +args(2).asInstanceOf[Array[Double]]) +} + } - def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max) + private[python] class SparseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val v: SparseVector = obj.asInstanceOf[SparseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector, +v.size.asInstanceOf[Object], v.indices, v.values) +} + } - def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min) -} + private[python] class SparseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new SparseVector(args(0).asInstanceOf[Int], args(1).asInstanceOf[Array[Int]], +args(2).asInstanceOf[Array[Double]]) +} + } -/** - * SerDe utility functions for PythonMLLibAPI. - */ -private[spark] object SerDe extends Serializable { - private val DENSE_VECTOR_MAGIC: Byte = 1 - private val SPARSE_VECTOR_MAGIC: Byte = 2 - private val DENSE_MATRIX_MAGIC: Byte = 3 - private val LABELED_POINT_MAGIC: Byte = 4 - - private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: Int = 0): Vector = { -require(bytes.length - offset = 5, Byte array too short) -val magic = bytes(offset) -if (magic == DENSE_VECTOR_MAGIC) { - deserializeDenseVector(bytes, offset) -} else if (magic == SPARSE_VECTOR_MAGIC) { - deserializeSparseVector(bytes, offset) -} else { - throw new IllegalArgumentException(Magic + magic + is wrong.) + private[python] class LabeledPointPickler extends IObjectPickler { +def pickle
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55680211 @staple How many iterations did you run? Did you generate data or load from disk/hdfs? Did you cache the Python RDD? When the dataset is not fully cached, I still expect similar performance. But your result shows a big gap. Maybe it is rotating cached blocks. 10m records: master: 2130.3178509871 w/ patch: 3232.856136322 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3396][MLLIB] Use SquaredL2Updater in Lo...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2398#issuecomment-55680373 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] Update SVD documentation in IndexedRow...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2389#issuecomment-55680470 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1778#issuecomment-55680636 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3516] [mllib] DecisionTree: Add minInst...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2349#issuecomment-55680612 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-2329 Add multi-label evaluation ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1270#discussion_r17578309 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.evaluation + +import org.apache.spark.rdd.RDD +import org.apache.spark.SparkContext._ + +/** + * Evaluator for multilabel classification. + * @param predictionAndLabels an RDD of (predictions, labels) pairs, both are non-null sets. + */ +class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], Set[Double])]) { --- End diff -- @avulanov Let's think what is more natural for the input data to a multi-label classifier and the output from the model it produces. They should match the input type here, so we can chain them easily. If we use either Java or Scala `Set`, we are going to have compatibility issues on the other. Also, set stores small objects, which increase GC pressure. There are the reasons I recommend using `Array[Double]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579647 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -18,6 +18,7 @@ package org.apache.spark.mllib.linalg.distributed import java.util.Arrays +import scala.collection.mutable.ListBuffer --- End diff -- separate scala imports from java ones --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579667 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { +similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach. + * + * The threshold parameter is a trade-off knob between estimate quality and computational cost. + * + * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly + * the same cost as the brute-force approach. Setting the threshold to positive values + * incurs strictly less computational cost than the brute-force approach, however the + * similarities computed will be estimates. + * + * The sampling guarantees relative-error correctness for those pairs of columns that have + * similarity greater than the given similarity threshold. + * + * To describe the guarantee, we set some notation: + * Let A be the smallest in magnitude non-zero element of this matrix. + * Let B be the largest in magnitude non-zero element of this matrix. + * Let L be the number of non-zeros per row. --- End diff -- Is it average or max number of nonzeros? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579671 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { +similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach. + * + * The threshold parameter is a trade-off knob between estimate quality and computational cost. + * + * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly + * the same cost as the brute-force approach. Setting the threshold to positive values + * incurs strictly less computational cost than the brute-force approach, however the + * similarities computed will be estimates. + * + * The sampling guarantees relative-error correctness for those pairs of columns that have + * similarity greater than the given similarity threshold. + * + * To describe the guarantee, we set some notation: + * Let A be the smallest in magnitude non-zero element of this matrix. + * Let B be the largest in magnitude non-zero element of this matrix. + * Let L be the number of non-zeros per row. + * + * For example, for {0,1} matrices: A=B=1. + * Another example, for the Netflix matrix: A=1, B=5 + * + * For those column pairs that are above the threshold, + * the computed similarity is correct to within 20% relative error with probability + * at least 1 - (0.981)^(100/B) + * + * The shuffle size is bounded by the *smaller* of the following two expressions: + * + * O(n log(n) L / (threshold * A)) + * O(m L^2) + * + * The latter is the cost of the brute-force approach, so for non-zero thresholds, + * the cost is always cheaper than the brute-force approach. + * + * @param threshold Similarities above this threshold are probably computed correctly. + * Set to 0 for deterministic guaranteed correctness. + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + def similarColumns(threshold: Double): CoordinateMatrix = { +require(threshold = 0 threshold = 1, sThreshold not in [0,1]: $threshold) + +val gamma = if (math.abs(threshold) 1e-6) { --- End diff -- remove `math.abs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579677 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { +similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach. + * + * The threshold parameter is a trade-off knob between estimate quality and computational cost. + * + * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly + * the same cost as the brute-force approach. Setting the threshold to positive values + * incurs strictly less computational cost than the brute-force approach, however the + * similarities computed will be estimates. + * + * The sampling guarantees relative-error correctness for those pairs of columns that have + * similarity greater than the given similarity threshold. + * + * To describe the guarantee, we set some notation: + * Let A be the smallest in magnitude non-zero element of this matrix. + * Let B be the largest in magnitude non-zero element of this matrix. + * Let L be the number of non-zeros per row. + * + * For example, for {0,1} matrices: A=B=1. + * Another example, for the Netflix matrix: A=1, B=5 + * + * For those column pairs that are above the threshold, + * the computed similarity is correct to within 20% relative error with probability + * at least 1 - (0.981)^(100/B) + * + * The shuffle size is bounded by the *smaller* of the following two expressions: + * + * O(n log(n) L / (threshold * A)) + * O(m L^2) + * + * The latter is the cost of the brute-force approach, so for non-zero thresholds, + * the cost is always cheaper than the brute-force approach. + * + * @param threshold Similarities above this threshold are probably computed correctly. + * Set to 0 for deterministic guaranteed correctness. + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + def similarColumns(threshold: Double): CoordinateMatrix = { +require(threshold = 0 threshold = 1, sThreshold not in [0,1]: $threshold) + +val gamma = if (math.abs(threshold) 1e-6) { + Double.PositiveInfinity +} else { + 100 * math.log(numCols()) / threshold +} + +similarColumnsDIMSUM(computeColumnSummaryStatistics().normL2.toArray, gamma) + } + + /** + * Find all similar columns using the DIMSUM sampling algorithm, described in two papers + * + * http://arxiv.org/abs/1206.2082 + * http://arxiv.org/abs/1304.1467 + * + * @param colMags A vector of column magnitudes + * @param gamma The oversampling parameter. For provable results, set to 100 * log(n) / s, + * where s is the smallest similarity score to be estimated, + * and n is the number of columns + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + private[mllib] def similarColumnsDIMSUM(colMags: Array[Double], + gamma: Double): CoordinateMatrix = { +require(gamma 1.0, sOversampling should be greater than 1: $gamma) +require(colMags.size == this.numCols(), Number of magnitudes didn't match column dimension) + +val sg = math.sqrt(gamma) // sqrt(gamma) used many times + +val sims = rows.flatMap { row = + val buf = new ListBuffer[((Int, Int), Double)]() + row.toBreeze.activeIterator.foreach { +case (_, 0.0) = // Skip explicit zero elements. +case (i, iVal) = + val rand = new scala.util.Random(iVal.toLong) --- End diff -- It won't give you a pseudo random sequence. Think about the case when all values are the same. The seed should be set on a partition level. Could you update this block with `mapPartitionsWithIndex`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579676 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { +similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach. + * + * The threshold parameter is a trade-off knob between estimate quality and computational cost. + * + * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly + * the same cost as the brute-force approach. Setting the threshold to positive values + * incurs strictly less computational cost than the brute-force approach, however the + * similarities computed will be estimates. + * + * The sampling guarantees relative-error correctness for those pairs of columns that have + * similarity greater than the given similarity threshold. + * + * To describe the guarantee, we set some notation: + * Let A be the smallest in magnitude non-zero element of this matrix. + * Let B be the largest in magnitude non-zero element of this matrix. + * Let L be the number of non-zeros per row. + * + * For example, for {0,1} matrices: A=B=1. + * Another example, for the Netflix matrix: A=1, B=5 + * + * For those column pairs that are above the threshold, + * the computed similarity is correct to within 20% relative error with probability + * at least 1 - (0.981)^(100/B) + * + * The shuffle size is bounded by the *smaller* of the following two expressions: + * + * O(n log(n) L / (threshold * A)) + * O(m L^2) + * + * The latter is the cost of the brute-force approach, so for non-zero thresholds, + * the cost is always cheaper than the brute-force approach. + * + * @param threshold Similarities above this threshold are probably computed correctly. + * Set to 0 for deterministic guaranteed correctness. + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + def similarColumns(threshold: Double): CoordinateMatrix = { +require(threshold = 0 threshold = 1, sThreshold not in [0,1]: $threshold) + +val gamma = if (math.abs(threshold) 1e-6) { + Double.PositiveInfinity +} else { + 100 * math.log(numCols()) / threshold +} + +similarColumnsDIMSUM(computeColumnSummaryStatistics().normL2.toArray, gamma) + } + + /** + * Find all similar columns using the DIMSUM sampling algorithm, described in two papers + * + * http://arxiv.org/abs/1206.2082 + * http://arxiv.org/abs/1304.1467 + * + * @param colMags A vector of column magnitudes + * @param gamma The oversampling parameter. For provable results, set to 100 * log(n) / s, + * where s is the smallest similarity score to be estimated, + * and n is the number of columns + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + private[mllib] def similarColumnsDIMSUM(colMags: Array[Double], + gamma: Double): CoordinateMatrix = { --- End diff -- 4-space indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579669 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { +similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach. + * + * The threshold parameter is a trade-off knob between estimate quality and computational cost. + * + * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly + * the same cost as the brute-force approach. Setting the threshold to positive values + * incurs strictly less computational cost than the brute-force approach, however the + * similarities computed will be estimates. + * + * The sampling guarantees relative-error correctness for those pairs of columns that have + * similarity greater than the given similarity threshold. + * + * To describe the guarantee, we set some notation: + * Let A be the smallest in magnitude non-zero element of this matrix. + * Let B be the largest in magnitude non-zero element of this matrix. + * Let L be the number of non-zeros per row. + * + * For example, for {0,1} matrices: A=B=1. + * Another example, for the Netflix matrix: A=1, B=5 + * + * For those column pairs that are above the threshold, + * the computed similarity is correct to within 20% relative error with probability + * at least 1 - (0.981)^(100/B) + * + * The shuffle size is bounded by the *smaller* of the following two expressions: + * + * O(n log(n) L / (threshold * A)) + * O(m L^2) + * + * The latter is the cost of the brute-force approach, so for non-zero thresholds, + * the cost is always cheaper than the brute-force approach. + * + * @param threshold Similarities above this threshold are probably computed correctly. --- End diff -- Elaborate more on `probably computed correctly`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579662 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { +similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach. --- End diff -- `all`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579648 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -27,10 +28,12 @@ import com.github.fommil.netlib.BLAS.{getInstance = blas} import org.apache.spark.annotation.Experimental import org.apache.spark.mllib.linalg._ import org.apache.spark.rdd.RDD +import org.apache.spark.SparkContext._ import org.apache.spark.Logging import org.apache.spark.mllib.rdd.RDDFunctions._ import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary} + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579690 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala --- @@ -95,6 +95,33 @@ class RowMatrixSuite extends FunSuite with LocalSparkContext { } } + test(similar columns) { +val colMags = Vectors.dense(Math.sqrt(126), Math.sqrt(66), Math.sqrt(94)) +val expected = BDM( + (126.0, 54.0, 72.0), + (54.0, 66.0, 78.0), + (72.0, 78.0, 94.0)) + +for(i - 0 until n) for(j - 0 until n) { + expected(i, j) /= (colMags(i) * colMags(j)) +} + +for (mat - Seq(denseMat, sparseMat)) { + val G = mat.similarColumns(150.0) + assert(closeToZero(G.toBreeze() - expected)) +} + +for (mat - Seq(denseMat, sparseMat)) { + val G = mat.similarColumns() + assert(closeToZero(G.toBreeze() - expected)) +} + +for (mat - Seq(denseMat, sparseMat)) { + val G = mat.similarColumnsDIMSUM(colMags.toArray, 150.0) + assert(closeToZero(G.toBreeze() - expected)) +} + } + --- End diff -- Does the test output only a subset of column pairs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579687 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.scala --- @@ -53,4 +53,14 @@ trait MultivariateStatisticalSummary { * Minimum value of each column. */ def min: Vector + + /** + * Euclidean magnitude of each column + */ + def normL2: Vector + + /** + * L1 norm of each column + */ + def normL1: Vector --- End diff -- `mean`, `stddev`, and `variance` are summary statistics of the random variable. The 1-norm of a random variable should be `E[|X|]` instead of `\sum_i |x_i|`. See http://www.math.uah.edu/stat/expect/Spaces.html . I suggest changing the method names to `norm1` and `norm2` and outputs the average instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1778#discussion_r17579689 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala --- @@ -95,6 +95,40 @@ class RowMatrixSuite extends FunSuite with LocalSparkContext { } } + test(similar columns) { +val colMags = Vectors.dense(Math.sqrt(126), Math.sqrt(66), Math.sqrt(94)) +val expected = BDM( + (0.0, 54.0, 72.0), + (0.0, 0.0, 78.0), + (0.0, 0.0, 0.0)) + +for(i - 0 until n) for(j - 0 until n) { + expected(i, j) /= (colMags(i) * colMags(j)) +} + +for (mat - Seq(denseMat, sparseMat)) { + val G = mat.similarColumns(0.1).toBreeze() + for(i - 0 until n) for(j - 0 until n) { --- End diff -- `for (i - 0 until n; j - 0 until n) {` (space after for and merge two for statements) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-55684623 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-55684719 @davies Thanks for working on MLlib's SerDe! It definitely simplifies future Python API implementations. We will wait #2378 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55685703 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55701824 this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-55776119 this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-55776094 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55795901 @davies Couple Python tests failed with this change. Could you fix them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55795929 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17630886 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging { }.toJavaRDD() } + private class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] { +private val pickle = new Pickler() +private var batch = 1 +private val buffer = new mutable.ArrayBuffer[Any] + +override def hasNext(): Boolean = iter.hasNext + +override def next(): Array[Byte] = { + while (iter.hasNext buffer.length batch) { +buffer += iter.next() + } + val bytes = pickle.dumps(buffer.toArray) + val size = bytes.length + // let 1M size 10M + if (size 1024 * 100) { +batch = (1024 * 100) / size // fast grow --- End diff -- If the first record is small, e.g., a SparseVector with a single nonzero, and the records followed are large vectors, line 789 may cause memory problems. Does it give significant performance gain? under what circumstances? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17632544 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging { }.toJavaRDD() } + private class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] { +private val pickle = new Pickler() +private var batch = 1 +private val buffer = new mutable.ArrayBuffer[Any] + +override def hasNext(): Boolean = iter.hasNext + +override def next(): Array[Byte] = { + while (iter.hasNext buffer.length batch) { +buffer += iter.next() + } + val bytes = pickle.dumps(buffer.toArray) + val size = bytes.length + // let 1M size 10M + if (size 1024 * 100) { +batch = (1024 * 100) / size // fast grow + } else if (size 1024 * 1024) { +batch *= 2 + } else if (size 1024 * 1024 * 10) { +batch /= 2 --- End diff -- If the first record is very large, `batch` will be 0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-55977238 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-55977249 this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-55977479 Adding new methods to a trait is a break change. We can mark `Vector` and `Matrix` as sealed, so no one can extend them. From Jenkins log: ~~~ [error] * method transposeTimes(org.apache.spark.mllib.linalg.DenseVector)org.apache.spark.mllib.linalg.DenseVector in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in old version [error]filter with: ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.transposeTimes) [error] * method transposeTimes(org.apache.spark.mllib.linalg.DenseMatrix)org.apache.spark.mllib.linalg.DenseMatrix in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in old version [error]filter with: ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.transposeTimes) [error] * method times(org.apache.spark.mllib.linalg.DenseVector)org.apache.spark.mllib.linalg.DenseVector in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in old version [error]filter with: ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.times) [error] * method times(org.apache.spark.mllib.linalg.DenseMatrix)org.apache.spark.mllib.linalg.DenseMatrix in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in old version [error]filter with: ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.times) [error] * method copy()org.apache.spark.mllib.linalg.Matrix in trait org.apache.spark.mllib.linalg.Matrix does not have a correspondent in old version [error]filter with: ProblemFilters.exclude[MissingMethodProblem](org.apache.spark.mllib.linalg.Matrix.copy) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r1770 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or --- End diff -- Doc needs update. It is a boolean now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704453 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704450 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704455 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704445 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) --- End diff -- `gemm` maybe called many times and there are cases when `alpha` is indeed zero. Maybe we can change it to `logDebug` to avoid outputting many log messages at a normal level. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704447 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704457 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704449 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704452 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704454 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17704446 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709067 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709059 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709070 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709063 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709076 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -36,9 +37,42 @@ trait Matrix extends Serializable { /** Converts to a breeze matrix. */ private[mllib] def toBreeze: BM[Double] + /** Gets the i-th element in the array backing the matrix. */ + private[mllib] def apply(i: Int): Double + /** Gets the (i, j)-th element. */ - private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j) + private[mllib] def apply(i: Int, j: Int): Double + + /** Return the index for the (i, j)-th element in the backing array. */ + private[mllib] def index(i: Int, j: Int): Int + + /** Update element at (i, j) */ + private[mllib] def update(i: Int, j: Int, v: Double) + + /** Get a deep copy of the matrix. */ + def copy: Matrix + + /** Convenience method for `Matrix`-`DenseMatrix` multiplication. */ + def times(y: DenseMatrix): DenseMatrix = { --- End diff -- We use `multiply` in distributed matrix and we should be consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709072 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -36,9 +37,42 @@ trait Matrix extends Serializable { /** Converts to a breeze matrix. */ private[mllib] def toBreeze: BM[Double] + /** Gets the i-th element in the array backing the matrix. */ + private[mllib] def apply(i: Int): Double --- End diff -- Maybe it is good to remove this method. Because it may cause confusion between DenseMatrix and SparseMatrix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709058 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709060 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709065 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709081 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -59,11 +93,113 @@ trait Matrix extends Serializable { */ class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double){ +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) +} + +/** + * Column-majored sparse matrix. + * The entry values are stored in Compressed Sparse Column (CSC) format. + * For example, the following matrix + * {{{ + * 1.0 0.0 4.0 + * 0.0 3.0 5.0 + * 2.0 0.0 6.0 + * }}} + * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`, + * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`. + * + * @param numRows number of rows + * @param numCols number of columns + * @param colPtrs the index corresponding to the start of a new column + * @param rowIndices the row index of the entry --- End diff -- Let's write down the contract. `rowIndices` should be strictly increasing for each column. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709077 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -36,9 +37,42 @@ trait Matrix extends Serializable { /** Converts to a breeze matrix. */ private[mllib] def toBreeze: BM[Double] + /** Gets the i-th element in the array backing the matrix. */ + private[mllib] def apply(i: Int): Double + /** Gets the (i, j)-th element. */ - private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j) + private[mllib] def apply(i: Int, j: Int): Double + + /** Return the index for the (i, j)-th element in the backing array. */ + private[mllib] def index(i: Int, j: Int): Int + + /** Update element at (i, j) */ + private[mllib] def update(i: Int, j: Int, v: Double) + + /** Get a deep copy of the matrix. */ + def copy: Matrix + + /** Convenience method for `Matrix`-`DenseMatrix` multiplication. */ + def times(y: DenseMatrix): DenseMatrix = { +val C: DenseMatrix = Matrices.zeros(numRows, y.numCols).asInstanceOf[DenseMatrix] +BLAS.gemm(false, false, 1.0, this, y, 0.0, C) +C + } + + /** Convenience method for `Matrix`-`DenseVector` multiplication. */ + def times(y: DenseVector): DenseVector = BLAS.gemv(1.0, this, y) + + /** Convenience method for `Matrix`^T^-`DenseMatrix` multiplication. */ + def transposeTimes(y: DenseMatrix): DenseMatrix = { --- End diff -- another approach is to have a boolean flag in matrix indicating whether it is transposed or not. ~~~ A.trans().times(y) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709069 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,368 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(sscal doesn't support vector type ${x.getClass}.) } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA specify whether to use matrix A, or the transpose of matrix A. Should be N or + * n to use A, and T or t to use the transpose of A. + * @param transB specify whether to use matrix B, or the transpose of matrix B. Should be N or + * n to use B, and T or t to use the transpose of B. + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logWarning(gemm: alpha is equal to 0. Returning C.) +} else { + A match { +case sparse: SparseMatrix = + gemm(transA, transB, alpha, sparse, B, beta, C) +case dense: DenseMatrix = + gemm(transA, transB, alpha, dense, B, beta, C) +case _ = + throw new IllegalArgumentException(sgemm doesn't support matrix type ${A.getClass}.) + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) N else T +val tBstr = if (!transB) N else T + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +require(kA == kB, sThe columns of A don't match the rows of B. A: $kA, B: $kB) +require(mA == C.numRows, sThe rows of C don't match the rows of A. C: ${C.numRows}, A: $mA) +require(nB == C.numCols, + sThe columns of C don't match the columns of B. C: ${C.numCols}, A: $nB) + +val Avals = A.values +val Arows = if (!transA) A.rowIndices else A.colPtrs +val Acols = if (!transA) A.colPtrs else A.rowIndices + +// Slicing is easy in this case. This is the optimal
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709101 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(m, n, colIndices, rowIndices, values).asInstanceOf[SparseMatrix] +assert(mat.numRows === m) +assert(mat.numCols === n) +assert(mat.values.eq(values), should not copy data) + } + + test(sparse matrix construction with wrong number of elements) { +intercept[RuntimeException] { --- End diff -- It should be `IllegalArgumentException` (thrown by `require`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709089 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -83,6 +219,24 @@ object Matrices { } /** + * Creates a column-majored sparse matrix in Compressed Sparse Column (CSC) format. + * + * @param numRows number of rows + * @param numCols number of columns + * @param colPointers the index corresponding to the start of a new column + * @param rowIndices the row index of the entry + * @param values non-zero matrix entries in column major + */ + def sparse( + numRows: Int, + numCols: Int, + colPointers: Array[Int], --- End diff -- `colPtrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709102 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(m, n, colIndices, rowIndices, values).asInstanceOf[SparseMatrix] +assert(mat.numRows === m) +assert(mat.numCols === n) +assert(mat.values.eq(values), should not copy data) + } + + test(sparse matrix construction with wrong number of elements) { +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 2.0)) +} + +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 2.0)) +} + } + + test(matrix copies are deep copies) { +val m = 3 +val n = 2 + +val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 5.0)) +val denseCopy = denseMat.copy + +assert(!denseMat.toArray.eq(denseCopy.toArray)) + +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) --- End diff -- `colPtrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709086 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -59,11 +93,113 @@ trait Matrix extends Serializable { */ class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double){ +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) +} + +/** + * Column-majored sparse matrix. + * The entry values are stored in Compressed Sparse Column (CSC) format. + * For example, the following matrix + * {{{ + * 1.0 0.0 4.0 + * 0.0 3.0 5.0 + * 2.0 0.0 6.0 + * }}} + * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`, + * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`. + * + * @param numRows number of rows + * @param numCols number of columns + * @param colPtrs the index corresponding to the start of a new column + * @param rowIndices the row index of the entry + * @param values non-zero matrix entries in column major + */ +class SparseMatrix( +val numRows: Int, +val numCols: Int, +val colPtrs: Array[Int], +val rowIndices: Array[Int], +val values: Array[Double]) extends Matrix { + + require(values.length == rowIndices.length, The number of row indices and values don't match! + +svalues.length: ${values.length}, rowIndices.length: ${rowIndices.length}) + require(colPtrs.length == numCols + 1, The length of the column indices should be the + +snumber of columns + 1. Currently, colPointers.length: ${colPtrs.length}, + +snumCols: $numCols) + + override def toArray: Array[Double] = { +val arr = Array.fill(numRows * numCols)(0.0) +var j = 0 +while (j numCols) { + var i = colPtrs(j) + val indEnd = colPtrs(j + 1) + val offset = j * numRows + while (i indEnd) { +val rowIndex = rowIndices(i) +arr(offset + rowIndex) = values(i) +i += 1 + } + j += 1 +} +arr + } + + private[mllib] def toBreeze: BM[Double] = +new BSM[Double](values, numRows, numCols, colPtrs, rowIndices) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = { +val ind = index(i, j) +if (ind == -1) 0.0 else values(ind) + } + + private[mllib] def index(i: Int, j: Int): Int = { +var regionStart = colPtrs(j) --- End diff -- use `Arrays.binarySearch`: http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#binarySearch(int[],%20int,%20int,%20int) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709099 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(m, n, colIndices, rowIndices, values).asInstanceOf[SparseMatrix] +assert(mat.numRows === m) +assert(mat.numCols === n) +assert(mat.values.eq(values), should not copy data) --- End diff -- also check `colPtrs` and `rowIndices` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709094 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeMatrixConversionSuite.scala --- @@ -37,4 +37,26 @@ class BreezeMatrixConversionSuite extends FunSuite { assert(mat.numCols === breeze.cols) assert(mat.values.eq(breeze.data), should not copy data) } + + test(sparse matrix to breeze) { +val values = Array(1.0, 2.0, 4.0, 5.0) +val colPointers = Array(0, 2, 4) --- End diff -- `colPtrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709082 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -59,11 +93,113 @@ trait Matrix extends Serializable { */ class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double){ +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) +} + +/** + * Column-majored sparse matrix. + * The entry values are stored in Compressed Sparse Column (CSC) format. + * For example, the following matrix + * {{{ + * 1.0 0.0 4.0 + * 0.0 3.0 5.0 + * 2.0 0.0 6.0 + * }}} + * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`, + * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`. + * + * @param numRows number of rows + * @param numCols number of columns + * @param colPtrs the index corresponding to the start of a new column + * @param rowIndices the row index of the entry + * @param values non-zero matrix entries in column major + */ +class SparseMatrix( +val numRows: Int, +val numCols: Int, +val colPtrs: Array[Int], +val rowIndices: Array[Int], +val values: Array[Double]) extends Matrix { + + require(values.length == rowIndices.length, The number of row indices and values don't match! + +svalues.length: ${values.length}, rowIndices.length: ${rowIndices.length}) + require(colPtrs.length == numCols + 1, The length of the column indices should be the + +snumber of columns + 1. Currently, colPointers.length: ${colPtrs.length}, + +snumCols: $numCols) + + override def toArray: Array[Double] = { +val arr = Array.fill(numRows * numCols)(0.0) --- End diff -- `new Array[Double](numRows * numCols)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709106 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(m, n, colIndices, rowIndices, values).asInstanceOf[SparseMatrix] +assert(mat.numRows === m) +assert(mat.numCols === n) +assert(mat.values.eq(values), should not copy data) + } + + test(sparse matrix construction with wrong number of elements) { +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 2.0)) +} + +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 2.0)) +} + } + + test(matrix copies are deep copies) { +val m = 3 +val n = 2 + +val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 5.0)) +val denseCopy = denseMat.copy + +assert(!denseMat.toArray.eq(denseCopy.toArray)) + +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val sparseMat = Matrices.sparse(m, n, colIndices, rowIndices, values) +val sparseCopy = sparseMat.copy + +assert(!sparseMat.toArray.eq(sparseCopy.toArray)) + } + + test(matrix indexing and updating) { +val m = 3 +val n = 2 +val allValues = Array(0.0, 1.0, 2.0, 3.0, 4.0, 0.0) + +val denseMat = new DenseMatrix(m, n, allValues) + +assert(denseMat(0, 1) == 3.0) +assert(denseMat(0, 1) == denseMat.values(3)) +assert(denseMat(0, 1) == denseMat(3)) +assert(denseMat(0, 0) == 0.0) + +denseMat.update(0, 0, 10.0) +assert(denseMat(0, 0) == 10.0) +assert(denseMat.values(0) == 10.0) + +val sparseValues = Array(1.0, 2.0, 3.0, 4.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 0, 1) +val sparseMat = new SparseMatrix(m, n, colIndices, rowIndices, sparseValues) + +assert(sparseMat(0, 1) == 3.0) +assert(sparseMat(0, 1) == sparseMat.values(2)) +assert(sparseMat(0, 1) == sparseMat(2)) +assert(sparseMat(0, 0) == 0.0) + +intercept[IllegalArgumentException] { + sparseMat.update(0, 0, 10.0) +} + +sparseMat.update(0, 1, 10.0) +assert(sparseMat(0, 1) == 10.0) --- End diff -- `==` - `===` (please update others as well) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709079 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -59,11 +93,113 @@ trait Matrix extends Serializable { */ class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double){ --- End diff -- `: Unit = {` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709096 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/BreezeMatrixConversionSuite.scala --- @@ -37,4 +37,26 @@ class BreezeMatrixConversionSuite extends FunSuite { assert(mat.numCols === breeze.cols) assert(mat.values.eq(breeze.data), should not copy data) } + + test(sparse matrix to breeze) { +val values = Array(1.0, 2.0, 4.0, 5.0) +val colPointers = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(3, 2, colPointers, rowIndices, values) +val breeze = mat.toBreeze.asInstanceOf[BSM[Double]] +assert(breeze.rows === mat.numRows) +assert(breeze.cols === mat.numCols) +assert(breeze.data.eq(mat.asInstanceOf[SparseMatrix].values), should not copy data) + } + + test(sparse breeze matrix to sparse matrix) { +val values = Array(1.0, 2.0, 4.0, 5.0) +val colPointers = Array(0, 2, 4) --- End diff -- `colPtrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709085 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -59,11 +93,113 @@ trait Matrix extends Serializable { */ class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double){ +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) +} + +/** + * Column-majored sparse matrix. + * The entry values are stored in Compressed Sparse Column (CSC) format. + * For example, the following matrix + * {{{ + * 1.0 0.0 4.0 + * 0.0 3.0 5.0 + * 2.0 0.0 6.0 + * }}} + * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`, + * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`. + * + * @param numRows number of rows + * @param numCols number of columns + * @param colPtrs the index corresponding to the start of a new column + * @param rowIndices the row index of the entry + * @param values non-zero matrix entries in column major + */ +class SparseMatrix( +val numRows: Int, +val numCols: Int, +val colPtrs: Array[Int], +val rowIndices: Array[Int], +val values: Array[Double]) extends Matrix { + + require(values.length == rowIndices.length, The number of row indices and values don't match! + +svalues.length: ${values.length}, rowIndices.length: ${rowIndices.length}) + require(colPtrs.length == numCols + 1, The length of the column indices should be the + +snumber of columns + 1. Currently, colPointers.length: ${colPtrs.length}, + +snumCols: $numCols) + + override def toArray: Array[Double] = { +val arr = Array.fill(numRows * numCols)(0.0) +var j = 0 +while (j numCols) { + var i = colPtrs(j) + val indEnd = colPtrs(j + 1) + val offset = j * numRows + while (i indEnd) { +val rowIndex = rowIndices(i) +arr(offset + rowIndex) = values(i) +i += 1 + } + j += 1 +} +arr + } + + private[mllib] def toBreeze: BM[Double] = +new BSM[Double](values, numRows, numCols, colPtrs, rowIndices) + + private[mllib] def apply(i: Int): Double = values(i) --- End diff -- ditto: remove this method and use `.values(i)` directly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709088 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -59,11 +93,113 @@ trait Matrix extends Serializable { */ class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double){ +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) +} + +/** + * Column-majored sparse matrix. + * The entry values are stored in Compressed Sparse Column (CSC) format. + * For example, the following matrix + * {{{ + * 1.0 0.0 4.0 + * 0.0 3.0 5.0 + * 2.0 0.0 6.0 + * }}} + * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`, + * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`. + * + * @param numRows number of rows + * @param numCols number of columns + * @param colPtrs the index corresponding to the start of a new column + * @param rowIndices the row index of the entry + * @param values non-zero matrix entries in column major + */ +class SparseMatrix( +val numRows: Int, +val numCols: Int, +val colPtrs: Array[Int], +val rowIndices: Array[Int], +val values: Array[Double]) extends Matrix { + + require(values.length == rowIndices.length, The number of row indices and values don't match! + +svalues.length: ${values.length}, rowIndices.length: ${rowIndices.length}) + require(colPtrs.length == numCols + 1, The length of the column indices should be the + +snumber of columns + 1. Currently, colPointers.length: ${colPtrs.length}, + +snumCols: $numCols) + + override def toArray: Array[Double] = { +val arr = Array.fill(numRows * numCols)(0.0) +var j = 0 +while (j numCols) { + var i = colPtrs(j) + val indEnd = colPtrs(j + 1) + val offset = j * numRows + while (i indEnd) { +val rowIndex = rowIndices(i) +arr(offset + rowIndex) = values(i) +i += 1 + } + j += 1 +} +arr + } + + private[mllib] def toBreeze: BM[Double] = +new BSM[Double](values, numRows, numCols, colPtrs, rowIndices) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = { +val ind = index(i, j) +if (ind == -1) 0.0 else values(ind) + } + + private[mllib] def index(i: Int, j: Int): Int = { +var regionStart = colPtrs(j) +var regionEnd = colPtrs(j + 1) +while (regionStart = regionEnd) { + val mid = (regionStart + regionEnd) / 2 + if (rowIndices(mid) == i){ +return mid + } else if (regionStart == regionEnd) { +return -1 + } else if (rowIndices(mid) i) { +regionEnd = mid + } else { +regionStart = mid + } +} +-1 + } + + private[mllib] def update(i: Int, j: Int, v: Double){ --- End diff -- `: Unit = {` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709092 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -93,9 +247,84 @@ object Matrices { require(dm.majorStride == dm.rows, Do not support stride size different from the number of rows.) new DenseMatrix(dm.rows, dm.cols, dm.data) + case sm: BSM[Double] = +new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data) case _ = throw new UnsupportedOperationException( sDo not support conversion from type ${breeze.getClass.getName}.) } } + + /** + * Generate a `DenseMatrix` consisting of zeros. + * @param numRows number of rows of the matrix + * @param numCols number of columns of the matrix + * @return `DenseMatrix` with size `numRows` x `numCols` and values of zeros + */ + def zeros(numRows: Int, numCols: Int): Matrix = +new DenseMatrix(numRows, numCols, Array.fill(numRows * numCols)(0.0)) --- End diff -- `new Array` instead of `Array.fill` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709103 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(m, n, colIndices, rowIndices, values).asInstanceOf[SparseMatrix] +assert(mat.numRows === m) +assert(mat.numCols === n) +assert(mat.values.eq(values), should not copy data) + } + + test(sparse matrix construction with wrong number of elements) { +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 2.0)) +} + +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 2.0)) +} + } + + test(matrix copies are deep copies) { +val m = 3 +val n = 2 + +val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 5.0)) +val denseCopy = denseMat.copy + +assert(!denseMat.toArray.eq(denseCopy.toArray)) + +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val sparseMat = Matrices.sparse(m, n, colIndices, rowIndices, values) +val sparseCopy = sparseMat.copy + +assert(!sparseMat.toArray.eq(sparseCopy.toArray)) + } + + test(matrix indexing and updating) { +val m = 3 +val n = 2 +val allValues = Array(0.0, 1.0, 2.0, 3.0, 4.0, 0.0) + +val denseMat = new DenseMatrix(m, n, allValues) + +assert(denseMat(0, 1) == 3.0) +assert(denseMat(0, 1) == denseMat.values(3)) +assert(denseMat(0, 1) == denseMat(3)) +assert(denseMat(0, 0) == 0.0) + +denseMat.update(0, 0, 10.0) +assert(denseMat(0, 0) == 10.0) +assert(denseMat.values(0) == 10.0) + +val sparseValues = Array(1.0, 2.0, 3.0, 4.0) +val colIndices = Array(0, 2, 4) --- End diff -- `colPtrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709097 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) --- End diff -- `colPtrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709104 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala --- @@ -36,4 +36,79 @@ class MatricesSuite extends FunSuite { Matrices.dense(3, 2, Array(0.0, 1.0, 2.0)) } } + + test(sparse matrix construction) { +val m = 3 +val n = 2 +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val mat = Matrices.sparse(m, n, colIndices, rowIndices, values).asInstanceOf[SparseMatrix] +assert(mat.numRows === m) +assert(mat.numCols === n) +assert(mat.values.eq(values), should not copy data) + } + + test(sparse matrix construction with wrong number of elements) { +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1), Array(1, 2, 1), Array(0.0, 1.0, 2.0)) +} + +intercept[RuntimeException] { + Matrices.sparse(3, 2, Array(0, 1, 2), Array(1, 2), Array(0.0, 1.0, 2.0)) +} + } + + test(matrix copies are deep copies) { +val m = 3 +val n = 2 + +val denseMat = Matrices.dense(m, n, Array(0.0, 1.0, 2.0, 3.0, 4.0, 5.0)) +val denseCopy = denseMat.copy + +assert(!denseMat.toArray.eq(denseCopy.toArray)) + +val values = Array(1.0, 2.0, 4.0, 5.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 1, 2) +val sparseMat = Matrices.sparse(m, n, colIndices, rowIndices, values) +val sparseCopy = sparseMat.copy + +assert(!sparseMat.toArray.eq(sparseCopy.toArray)) + } + + test(matrix indexing and updating) { +val m = 3 +val n = 2 +val allValues = Array(0.0, 1.0, 2.0, 3.0, 4.0, 0.0) + +val denseMat = new DenseMatrix(m, n, allValues) + +assert(denseMat(0, 1) == 3.0) +assert(denseMat(0, 1) == denseMat.values(3)) +assert(denseMat(0, 1) == denseMat(3)) +assert(denseMat(0, 0) == 0.0) + +denseMat.update(0, 0, 10.0) +assert(denseMat(0, 0) == 10.0) +assert(denseMat.values(0) == 10.0) + +val sparseValues = Array(1.0, 2.0, 3.0, 4.0) +val colIndices = Array(0, 2, 4) +val rowIndices = Array(1, 2, 0, 1) +val sparseMat = new SparseMatrix(m, n, colIndices, rowIndices, sparseValues) + +assert(sparseMat(0, 1) == 3.0) +assert(sparseMat(0, 1) == sparseMat.values(2)) +assert(sparseMat(0, 1) == sparseMat(2)) +assert(sparseMat(0, 0) == 0.0) + +intercept[IllegalArgumentException] { --- End diff -- Should throw `NoSuchElementException`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2294#discussion_r17709093 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/BLASSuite.scala --- @@ -126,4 +126,116 @@ class BLASSuite extends FunSuite { } } } + + test(gemm) { + +val dA = + new DenseMatrix(4, 3, Array(0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 3.0)) +val sA = new SparseMatrix(4, 3, Array(0, 1, 3, 4), Array(1, 0, 2, 3), Array(1.0, 2.0, 1.0, 3.0)) + +val B = new DenseMatrix(3, 2, Array(1.0, 0.0, 0.0, 0.0, 2.0, 1.0)) +val expected = new DenseMatrix(4, 2, Array(0.0, 1.0, 0.0, 0.0, 4.0, 0.0, 2.0, 3.0)) + +assert(dA times B ~== expected absTol 1e-15) +assert(sA times B ~== expected absTol 1e-15) + +val C1 = new DenseMatrix(4, 2, Array(1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0)) +val C2 = C1.copy +val C3 = C1.copy +val C4 = C1.copy +val C5 = C1.copy +val C6 = C1.copy +val C7 = C1.copy +val C8 = C1.copy +val expected2 = new DenseMatrix(4, 2, Array(2.0, 1.0, 4.0, 2.0, 4.0, 0.0, 4.0, 3.0)) +val expected3 = new DenseMatrix(4, 2, Array(2.0, 2.0, 4.0, 2.0, 8.0, 0.0, 6.0, 6.0)) + +gemm(1.0, dA, B, 2.0, C1) +gemm(1.0, sA, B, 2.0, C2) +gemm(2.0, dA, B, 2.0, C3) +gemm(2.0, sA, B, 2.0, C4) +assert(C1 ~== expected2 absTol 1e-15) +assert(C2 ~== expected2 absTol 1e-15) +assert(C3 ~== expected3 absTol 1e-15) +assert(C4 ~== expected3 absTol 1e-15) + +withClue(columns of A don't match the rows of B) { + intercept[Exception] { +gemm(true, false, 1.0, dA, B, 2.0, C1) + } +} + +val dAT = + new DenseMatrix(3, 4, Array(0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 3.0)) +val sAT = + new SparseMatrix(3, 4, Array(0, 1, 2, 3, 4), Array(1, 0, 1, 2), Array(2.0, 1.0, 1.0, 3.0)) + +assert(dAT transposeTimes B ~== expected absTol 1e-15) +assert(sAT transposeTimes B ~== expected absTol 1e-15) + +gemm(true, false, 1.0, dAT, B, 2.0, C5) +gemm(true, false, 1.0, sAT, B, 2.0, C6) +gemm(true, false, 2.0, dAT, B, 2.0, C7) +gemm(true, false, 2.0, sAT, B, 2.0, C8) +assert(C5 ~== expected2 absTol 1e-15) +assert(C6 ~== expected2 absTol 1e-15) +assert(C7 ~== expected3 absTol 1e-15) +assert(C8 ~== expected3 absTol 1e-15) + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56135525 @JoshRosen PySpark/MLlib requires NumPy to run, and I don't think we claimed that we support different versions of NumPy. `sample()` in core is different. Maybe we can test the overhead of using Python's own random. The overhead may not be huge, due to serialization/deserialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2455#discussion_r17769391 --- Diff: core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala --- @@ -43,66 +46,218 @@ trait RandomSampler[T, U] extends Pseudorandom with Cloneable with Serializable throw new NotImplementedError(clone() is not implemented.) } + +object RandomSampler { + // Default random number generator used by random samplers + def rngDefault: Random = new XORShiftRandom + + // Default gap sampling maximum + // For sampling fractions = this value, the gap sampling optimization will be applied. + // Above this value, it is assumed that tradtional bernoulli sampling is faster. The + // optimal value for this will depend on the RNG. More expensive RNGs will tend to make + // the optimal value higher. The most reliable way to determine this value for a given RNG + // is to experiment. I would expect a value of 0.5 to be close in most cases. + def gsmDefault: Double = 0.4 + + // Default gap sampling epsilon + // When sampling random floating point values the gap sampling logic requires value 0. An + // optimal value for this parameter is at or near the minimum positive floating point value + // returned by nextDouble() for the RNG being used. + def epsDefault: Double = 5e-11 +} + + /** * :: DeveloperApi :: * A sampler based on Bernoulli trials. * - * @param lb lower bound of the acceptance range - * @param ub upper bound of the acceptance range - * @param complement whether to use the complement of the range specified, default to false + * @param fraction the sampling fraction, aka Bernoulli sampling probability * @tparam T item type */ @DeveloperApi -class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = false) - extends RandomSampler[T, T] { +class BernoulliSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T] { - private[random] var rng: Random = new XORShiftRandom + require(fraction = 0.0fraction = 1.0, Sampling fraction must be on interval [0, 1]) - def this(ratio: Double) = this(0.0d, ratio) + def this(lb: Double, ub: Double, complement: Boolean = false) = +this(if (complement) (1.0 - (ub - lb)) else (ub - lb)) + + private val rng: Random = RandomSampler.rngDefault override def setSeed(seed: Long) = rng.setSeed(seed) override def sample(items: Iterator[T]): Iterator[T] = { -items.filter { item = - val x = rng.nextDouble() - (x = lb x ub) ^ complement +fraction match { + case f if (f = 0.0) = Iterator.empty + case f if (f = 1.0) = items + case f if (f = RandomSampler.gsmDefault) = +new GapSamplingIterator(items, f, rng, RandomSampler.epsDefault) + case _ = items.filter(_ = (rng.nextDouble() = fraction)) --- End diff -- Did you test whether `rdd.randomSplit()` will produce non-overlapping subsets with this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-56136224 LGTM. I'm merging this into master. (We might need to make slight changes to some methods before the 1.2 release, but let's not block the multi-model training PR for now.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56136476 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] fix a unresolved reference variable 'n...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2423#issuecomment-56136584 @OdinLin Thanks for catching the bug! As @davies mentioned, #2378 will completely replace the current SerDe. Could you close this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-56136714 @derrickburns I cannot see the Jenkins log. Let's call Jenkins again. test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2455#issuecomment-56144570 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2455#issuecomment-56144582 this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56147622 @davies Does `PickleSerializer` compress data? If not, maybe we should cache the deserialized RDD instead of the one from `_.reserialize`. They have the same storage. I understand that batch-serialization can help GC. But algorithms like linear methods should only allocate short-lived objects. Is batch-serialization worth the tradeoff? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-56235934 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org