[GitHub] spark issue #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patt...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/15398 Ping @jodersky, @rxin, @mengxr, @yhuai, @hvanhovell we also run into this bug. Will be great we have it by 2.2 release. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17606: [SPARK-20291][SQL] NaNvl(FloatType, NullType) sho...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17606#discussion_r111216578 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -571,6 +571,7 @@ object TypeCoercion { NaNvl(l, Cast(r, DoubleType)) case NaNvl(l, r) if l.dataType == FloatType && r.dataType == DoubleType => NaNvl(Cast(l, DoubleType), r) + case NaNvl(l, r) if r.dataType == NullType => NaNvl(l, Cast(r, l.dataType)) --- End diff -- Since NaNvl evaluates to `right` when `left` is NaN, I think `right` should always cast to `left`. I wonder what is the behavior of other engines? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17600: [MINOR][SQL] Fix the @since tag when backporting ...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/17600 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17618: [SPARK-20291][SQL][BACKPORT] NaNvl(FloatType, Nul...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/17618 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17618: [SPARK-20291][SQL][BACKPORT] NaNvl(FloatType, NullType) ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17618 Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17618: [SPARK-20291][SQL][BACKPORT] NaNvl(FloatType, NullType) ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17618 @viirya Added. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17618: [SPARK-20291][SQL] NaNvl(FloatType, NullType) sho...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/17618 [SPARK-20291][SQL] NaNvl(FloatType, NullType) should not be cast to N⦠â¦aNvl(DoubleType, DoubleType) ## What changes were proposed in this pull request? `NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`. This will cause mismatching in the output type when the input type is float. By adding extra rule in TypeCoercion can resolve this issue. ## How was this patch tested? unite tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17618.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17618 commit d3aa8bddac7ac738cb3de2cee0eadf1a9740eaa6 Author: DB Tsai <dbt...@dbtsai.com> Date: 2017-04-11T00:29:33Z fix since tag version commit f060172004a9336d33f091b0d4e8fe795c3cafad Author: DB Tsai <d...@netflix.com> Date: 2017-04-12T06:36:24Z backport --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17601: [MINOR][SQL] Fix the @since tag when backporting ...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/17601 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17606: [SPARK-20291][SQL] NaNvl(FloatType, NullType) sho...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17606#discussion_r110991443 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -656,14 +656,20 @@ class TypeCoercionSuite extends PlanTest { test("nanvl casts") { ruleTest(TypeCoercion.FunctionArgumentConversion, - NaNvl(Literal.create(1.0, FloatType), Literal.create(1.0, DoubleType)), - NaNvl(Cast(Literal.create(1.0, FloatType), DoubleType), Literal.create(1.0, DoubleType))) + NaNvl(Literal.create(1.0f, FloatType), Literal.create(1.0, DoubleType)), + NaNvl(Cast(Literal.create(1.0f, FloatType), DoubleType), Literal.create(1.0, DoubleType))) ruleTest(TypeCoercion.FunctionArgumentConversion, - NaNvl(Literal.create(1.0, DoubleType), Literal.create(1.0, FloatType)), - NaNvl(Literal.create(1.0, DoubleType), Cast(Literal.create(1.0, FloatType), DoubleType))) + NaNvl(Literal.create(1.0, DoubleType), Literal.create(1.0f, FloatType)), + NaNvl(Literal.create(1.0, DoubleType), Cast(Literal.create(1.0f, FloatType), DoubleType))) ruleTest(TypeCoercion.FunctionArgumentConversion, NaNvl(Literal.create(1.0, DoubleType), Literal.create(1.0, DoubleType)), NaNvl(Literal.create(1.0, DoubleType), Literal.create(1.0, DoubleType))) +ruleTest(TypeCoercion.FunctionArgumentConversion, + NaNvl(Literal.create(1.0f, FloatType), Literal.create(null, NullType)), + NaNvl(Literal.create(1.0f, FloatType), Literal.create(null, FloatType))) --- End diff -- Thanks. The test is fixed. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17606: [SPARK-20291][SQL] NaNvl(FloatType, NullType) should not...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17606 +cc @cloud-fan @gatorsmile @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17606: [SPARK-20291][SQL] NaNvl(FloatType, NullType) sho...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17606#discussion_r110979361 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -571,6 +571,7 @@ object TypeCoercion { NaNvl(l, Cast(r, DoubleType)) case NaNvl(l, r) if l.dataType == FloatType && r.dataType == DoubleType => NaNvl(Cast(l, DoubleType), r) + case NaNvl(l, r) if r.dataType == NullType => NaNvl(l, Cast(r, l.dataType)) --- End diff -- Yeah, this PR prevents casting from `NaNvl(FloatType, NullType)` to `NaNvl(DoubleType, DoubleType)` since we want to minimize the casting as much as possible. Also, if we want to replace `NaN` by `null`, we want to keep the output type the same as input type. Whether `NaNvl(FloatType, DoubleType)` should be cast into `NaNvl(DoubleType, DoubleType)` is another story. I agree with you, we should downcast the replacement `DoubleType` into `FloatType`. And in my opinion, doing this implicit casting is error-prone, and we should do explicit casting by users instead. @gatorsmile maybe you can chime in, and give the feedback whether we should cast `NaNvl(FloatType, DoubleType)` to `NaNvl(DoubleType, DoubleType)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17606: [SPARK-20291][SQL] NaNvl(FloatType, NullType) sho...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/17606 [SPARK-20291][SQL] NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType) ## What changes were proposed in this pull request? `NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`. This will cause mismatching in the output type when the input type is float. By adding extra rule in TypeCoercion can resolve this issue. ## How was this patch tested? unite tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark fixNaNvl Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17606.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17606 commit fa5e1aff1319a75e89da8baf48f06b223b17eb8c Author: DB Tsai <d...@netflix.com> Date: 2017-04-11T08:50:24Z Added new NaNvl type coercion rule --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17600: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17600 Merged into mater. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17601: [MINOR][SQL] Fix the @since tag when backporting SPARK-1...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17601 Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17601: [MINOR][SQL] Fix the @since tag when backporting ...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/17601 [MINOR][SQL] Fix the @since tag when backporting critical bugs from 2.2 branch into 2.0 branch ## What changes were proposed in this pull request? Fix the @since tag when backporting critical bugs from 2.2 branch into 2.0 branch. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17601.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17601 commit d3aa8bddac7ac738cb3de2cee0eadf1a9740eaa6 Author: DB Tsai <dbt...@dbtsai.com> Date: 2017-04-11T00:29:33Z fix since tag version --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17600: [MINOR][SQL] Fix the @since tag when backporting ...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/17600 [MINOR][SQL] Fix the @since tag when backporting critical bugs from 2.2 branch into 2.1 branch ## What changes were proposed in this pull request? Fix the @since tag when backporting critical bugs from 2.2 branch into 2.1 branch. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark branch-2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17600.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17600 commit 61e2497f36a9e38b06b512ae9552a24f26448a9e Author: DB Tsai <dbt...@dbtsai.com> Date: 2017-04-11T00:21:23Z update spark version --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17577: [SPARK-20270][SQL] na.fill should not change the values ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17577 Merged into master. @cloud-fan #15994 is still needed when a user wants to fill in default long value with a extremely large value into NaN. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17577: [SPARK-20270][SQL] na.fill should not change the values ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17577 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17577: [SPARK-20270][SQL] na.fill should not change the ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17577#discussion_r110549401 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala --- @@ -407,10 +407,11 @@ final class DataFrameNaFunctions private[sql](df: DataFrame) { val quotedColName = "`" + col.name + "`" val colValue = col.dataType match { case DoubleType | FloatType => -nanvl(df.col(quotedColName), lit(null)) // nanvl only supports these types +// nanvl only supports these types +nanvl(df.col(quotedColName), lit(null).cast(col.dataType)) --- End diff -- The float one is covered by another test in `fill with map`, but still nice to have them explicitly to avoid being changed unexpectedly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17577: [SPARK-20270][SQL] na.fill should not change the ...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/17577 [SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double ## What changes were proposed in this pull request? This bug was partially addressed in SPARK-18555 https://github.com/apache/spark/pull/15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with ``` Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), ``` the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting. ## How was this patch tested? unit test added. +cc @srowen @rxin @cloud-fan @gatorsmile Thanks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark fixnafill Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17577.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17577 commit cb5440787d676d2c74983ddcd1df31b38d009d71 Author: DB Tsai <d...@netflix.com> Date: 2017-04-09T07:57:45Z na.fill will change the values in long or integer when the default value is in double --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17419 Will be really interested to see the performance benchmark durning the QA period so users can know when to use the dataframe apis or existing rdd apis. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17426: [SPARK-17137][ML][WIP] Compress logistic regressi...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17426#discussion_r108040945 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -617,26 +612,13 @@ class LogisticRegression @Since("1.2.0") ( denseCoefficientMatrix.update(_ - coefficientMean) } --- End diff -- BTW, we could compress the `coefficients` from driver to executors to save communication cost. Can do in a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/15628 Thanks @sethah and Jenkins! Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107972466 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -1079,4 +1267,15 @@ object Matrices { SparseMatrix.fromCOO(numRows, numCols, entries) } } + + private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = { +// 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 12 + 12 + 12 + 1 +12L * numActives + 4L * numPtrs + 37L --- End diff -- Nice that we can get `37L` using java apis to ensure the portability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107963752 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -1079,4 +1267,15 @@ object Matrices { SparseMatrix.fromCOO(numRows, numCols, entries) } } + + private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = { +// 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 12 + 12 + 12 + 1 +12L * numActives + 4L * numPtrs + 37L + } + + private[ml] def getDenseSize(numCols: Long, numRows: Long): Long = { +// 8 * values.length + 12 + 1 --- End diff -- Can you document what is the magical number `12 + 1`? Also, can we make it `java.lang.Double.BYTES * numCols * numRows + 13L` since the size of primitive type can depend on the implementation of JVM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107961128 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +168,116 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix while maintaining the layout of the current matrix. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = !isTransposed) --- End diff -- `def toSparse: SparseMatrix = toSparseMatrix(colMajor = isColMajor)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107966467 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -1079,4 +1267,15 @@ object Matrices { SparseMatrix.fromCOO(numRows, numCols, entries) } } + + private[ml] def getSparseSize(numActives: Long, numPtrs: Long): Long = { +// 8 * values.length + 4 * rowIndices.length + 4 * colPtrs.length + 12 + 12 + 12 + 1 +12L * numActives + 4L * numPtrs + 37L --- End diff -- `(java.lang.Double.BYTES + java.lang.Integer.BYTES) * numActives + java.lang.Integer.BYTES * numPtrs + 37L` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107961223 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +168,116 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix while maintaining the layout of the current matrix. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = !isTransposed) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix while maintaining the layout of the current matrix. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = !isTransposed) --- End diff -- `def toDense: DenseMatrix = toDenseMatrix(colMajor = isColMajor)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107844935 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107831775 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +404,49 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (isTransposed && colMajor) { + new DenseMatrix(numRows, numCols, toArray, isTransposed = false) +} else if (!isTransposed && !colMajor) { + new DenseMatrix(numRows, numCols, transpose.toArray, isTransposed = true) +} else { + this } -new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) } --- End diff -- Sounds good. Let's do it in another PR. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832496 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) +} else { + if (cscSize <= csrSize) { +toSparseMatrix(colMajor = true) + } else { +toSparseMatrix(colMajor = false) --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107836197 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107836216 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107837259 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107837491 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107834704 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) --- End diff -- Can you group the tests either by `dm1` and `dm2` or by the same methods? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832751 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -587,18 +720,67 @@ class SparseMatrix @Since("2.0.0") ( } } + override def numNonzeros: Int = values.count(_ != 0) + + override def numActives: Int = values.length + /** - * Generate a `DenseMatrix` from the given `SparseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from this `SparseMatrix`, removing explicit zero values if they + * exist. + * + * @param colMajor Whether or not the resulting `SparseMatrix` values are in column major + *order. */ - @Since("2.0.0") - def toDense: DenseMatrix = { -new DenseMatrix(numRows, numCols, toArray) + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!isTransposed && !colMajor) { + // it is row major and we want col major, use breeze to remove explicit zeros + val breezeTransposed = asBreeze.asInstanceOf[BSM[Double]].t + Matrices.fromBreeze(breezeTransposed).transpose.asInstanceOf[SparseMatrix] +} else if (isTransposed && colMajor) { + // it is col major and we want row major, use breeze to remove explicit zeros + val breezeTransposed = asBreeze.asInstanceOf[BSM[Double]] + Matrices.fromBreeze(breezeTransposed).asInstanceOf[SparseMatrix] +} else { + val nnz = numNonzeros + if (nnz != numActives) { +// remove explicit zeros +val rr = new Array[Int](nnz) +val vv = new Array[Double](nnz) +val numPtrs = if (isTransposed) numRows else numCols +val cc = new Array[Int](numPtrs + 1) +var nzIdx = 0 +var j = 0 +while (j < numPtrs) { + var idx = colPtrs(j) + val idxEnd = colPtrs(j + 1) + cc(j) = nzIdx + while (idx < idxEnd) { +if (values(idx) != 0.0) { + vv(nzIdx) = values(idx) + rr(nzIdx) = rowIndices(idx) + nzIdx += 1 +} +idx += 1 + } + j += 1 +} +cc(j) = nnz +new SparseMatrix(numRows, numCols, cc, rr, vv, isTransposed = isTransposed) + } else { +this + } +} } - override def numNonzeros: Int = values.count(_ != 0) - - override def numActives: Int = values.length + /** + * Generate a `DenseMatrix` from the given `SparseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values are in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (colMajor) new DenseMatrix(numRows, numCols, toArray) --- End diff -- `new DenseMatrix(numRows, numCols, this.toArray, isTransposed = false)` to make the style consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107836205 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107835539 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107835746 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107837113 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107836138 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832900 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = !isTransposed) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = !isTransposed) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor --- End diff -- nit, with `this` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107837448 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107836155 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107835989 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107835326 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) --- End diff -- `val dm4 = dm1.toDenseRowMajor` and `val dm7 = dm1.toDenseRowMajor` are the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832481 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) +} else { + if (cscSize <= csrSize) { +toSparseMatrix(colMajor = true) --- End diff -- ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107837410 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107835213 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + --- End diff -- You tested `dm1.toSparseColMajor` twice. Will be nice to group them like ```scala val sm1 = dm1.toSparseColMajor val sm2 = dm2.toSparseColMajor val sm3 = dm3.toSparseColMajor val sm4 = dm1.toSparseRowMajor val sm5 = dm2.toSparseRowMajor val sm6 = dm3.toSparseRowMajor val sm7 = dm1.toSparse val sm8 = dm2.toSparse val sm9 = dm3.toSparse ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832469 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) --- End diff -- I don't see the change here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107835655 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,395 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseRowMajor +assert(dm4 === dm1) +assert(dm4.isTransposed) +assert(dm4.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm5 = dm2.toDenseColMajor +assert(dm5 === dm2) +assert(!dm5.isTransposed) +assert(dm5.values === Array(5.0, 1.0, -9.0, -3.0, 4.0, -8.0)) + +val dm6 = dm2.toDenseRowMajor +assert(dm6 === dm2) +assert(dm6.isTransposed) +assert(dm6.values.equals(dm2.values)) + +val dm7 = dm1.toDenseRowMajor +assert(dm7 === dm1) +assert(dm7.isTransposed) +assert(dm7.values === Array(4.0, 2.0, -8.0, -1.0, 7.0, 4.0)) + +val dm8 = dm1.toDenseColMajor +assert(dm8 === dm1) +assert(!dm8.isTransposed) +assert(dm8.values.equals(dm1.values)) + +val dm9 = dm2.toDense +assert(dm9 === dm2) +assert(dm9.isTransposed) +assert(dm9.values.equals(dm2.values)) + } -val spMat1 = new SparseMatrix(m, n, colPtrs, rowIndices, values) -val deMat1 = new DenseMatrix(m, n, allValues) + test("dense to sparse") { +/* + dm1 = 0.0 4.0 5.0 +0.0 2.0 0.0 + + dm2 = 0.0 4.0 5.0 +0.0 2.0 0.0 -val spMat2 = deMat1.toSparse -val deMat2 = spMat1.toDense + dm3 = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(0.0, 0.0, 4.0, 2.0, 5.0, 0.0)) +val dm2 = new DenseMatrix(2, 3, Array(0.0, 4.0, 5.0, 0.0, 2.0, 0.0), isTransposed = true) +val dm3 = new DenseMatrix(2, 3, Array(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)) + +val sm1 = dm1.toSparseColMajor +assert(sm1 === dm1) +assert(!sm1.isTransposed) +assert(sm1.values === Array(4.0, 2.0, 5.0)) + +val sm2 = dm1.toSparseRowMajor +assert(sm2 === dm1) +assert(sm2.isTransposed) +assert(sm2.values === Array(4.0, 5.0, 2.0)) + +val sm3 = dm2.toSparseColMajor +assert(sm3 === dm2) +assert(!sm3.isTransposed) +assert(sm3.values === Array(4.0, 2.0, 5.0)) + +val sm4 = dm2.toSparseRowMajor +assert(sm4 === dm2) +assert(sm4.isTransposed) +assert(sm4.values === Array(4.0, 5.0, 2.0)) + +val sm5 = dm3.toSparseColMajor +assert(sm5 === dm3) +assert(sm5.values === Array.empty[Double]) +assert(!sm5.isTransposed) + +val sm6 = dm3.toSparseRowMajor +assert(sm6 === dm3) +assert(sm6.values === Array.empty[Double]) +assert(sm6.isTransposed) + +val sm7 = dm1.toSparse +assert(sm7 === dm1) +assert(sm7.values === Array(4.0, 2.0, 5.0)) +assert(!sm7.isTransposed) + +val sm8 = dm1.toSparseColMajor +assert(sm8 === dm1) +assert(sm8.values === Array(4.0, 2.0, 5.0)) +assert(!sm8.isTransposed) + +val sm9 = dm2.toSparseRowMajor +assert(sm9 === dm2) +assert(sm9.values === Array(4.0, 5.0, 2.0)) +assert(sm9.isTransposed) + +val sm10 = dm2.toSparse +assert(sm10 === dm2) +assert(sm10.values === Array(4.0, 5.0, 2.0)) +assert(sm10.isTransposed) + } + + test("sparse to sparse") { +/* + sm1 = sm2 = sm3 = sm4 = 0.0 4.0 5.0 + 0.0 2.0 0.0 + smZeros = 0.0 0.0 0.0 +0.0 0.0 0.0 + */ +val sm1 = new SparseMatrix(2, 3, Array(0, 0, 2, 3), Array(0, 1, 0), Array(4.0, 2.0, 5.0)) +val sm2 = new SparseMatrix(2, 3, Array(0, 2, 3), Array(1, 2, 1), Array(4.0, 5.0, 2.0), + isTransposed = true) +val sm3 = new Spars
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832867 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) --- End diff -- `this.toDense` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107832826 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +404,49 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (isTransposed && colMajor) { + new DenseMatrix(numRows, numCols, toArray, isTransposed = false) +} else if (!isTransposed && !colMajor) { + new DenseMatrix(numRows, numCols, transpose.toArray, isTransposed = true) --- End diff -- I'll call `this.toArray`, and `this.transpose.toArray` as you did in other place to make it explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107784952 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) +} else { + if (cscSize <= csrSize) { +toSparseMatrix(colMajor = true) + } else { +toSparseMatrix(colMajor = false) --- End diff -- `toSparseRowMajor` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107785054 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) --- End diff -- Call `toDense` if we decide to make `toDense` and `toSparse` outputting the same layout. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107784897 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,118 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparseColMajor: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toSparseRowMajor: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in dense or sparse column major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedColMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = true)) { + toDenseColMajor +} else { + toSparseColMajor +} + } + + /** + * Returns a matrix in dense or sparse row major format, whichever uses less storage. + */ + @Since("2.2.0") + def compressedRowMajor: Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor = false)) { + toDenseRowMajor +} else { + toSparseRowMajor +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(colMajor = true) +val csrSize = getSparseSizeInBytes(colMajor = false) +if (getDenseSizeInBytes < math.min(cscSize, csrSize)) { + // dense matrix size is the same for column major and row major, so maintain current layout + toDenseMatrix(!isTransposed) +} else { + if (cscSize <= csrSize) { +toSparseMatrix(colMajor = true) --- End diff -- `toSparseColumnMajor` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107794035 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +404,49 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (isTransposed && colMajor) { + new DenseMatrix(numRows, numCols, toArray, isTransposed = false) +} else if (!isTransposed && !colMajor) { + new DenseMatrix(numRows, numCols, transpose.toArray, isTransposed = true) +} else { + this } -new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) } --- End diff -- Could we override the `toArray` in DenseMatrix so when `this` is column major, we just return `this.values`? Otherwise, it's very expensive to create a new array. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107791878 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +404,49 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ --- End diff -- Minor, can we have ```scala protected def isColMajor = !isTransposed protected def isRowMajor = isTransposed ``` So the code can be understood easier? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107781333 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + --- End diff -- ditto. should we consider to maintain the same layout? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107779234 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,110 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSR: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) --- End diff -- But I thought this is a new api being added, so we can make it to maintain the same layout. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107557490 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +395,60 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (!(isTransposed ^ colMajor)) { + val newValues = new Array[Double](numCols * numRows) --- End diff -- This looks great to me! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107556503 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -587,18 +722,69 @@ class SparseMatrix @Since("2.0.0") ( } } + override def numNonzeros: Int = values.count(_ != 0) + + override def numActives: Int = values.length + /** - * Generate a `DenseMatrix` from the given `SparseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from this `SparseMatrix`, removing explicit zero values if they + * exist. + * + * @param colMajor Whether or not the resulting `SparseMatrix` values are in column major + *order. */ - @Since("2.0.0") - def toDense: DenseMatrix = { -new DenseMatrix(numRows, numCols, toArray) + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!(colMajor ^ isTransposed)) { + // breeze transpose rearranges values in column major and removes explicit zeros --- End diff -- This is not a blocker. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107519619 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,109 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) --- End diff -- After thinking about it again, let's have it as `toSparseColumnMajor` to make the apis consistent with the dense ones if you don't mind? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107518109 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { */ --- End diff -- BTW, it's nice to have return type in public method. Can you add `Unit` as return type? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107519221 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +395,60 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (!(isTransposed ^ colMajor)) { + val newValues = new Array[Double](numCols * numRows) --- End diff -- The following could work, and we only need one implementation in trait. Thanks. ```scala trait Matrix { var isTransposed: Boolean = true var numCols: Int = 0 var numRows: Int = 0 def foreachActive(f: (Int, Int, Double) => Unit): Unit def toDenseMatrix(colMajor: Boolean): Matrix = { this match { case _: DenseMatrix if this.isTransposed != colMajor => this case _: SparseMatrix | _: DenseMatrix if this.isTransposed == colMajor => val newValues = new Array[Double](numCols * numRows) this.foreachActive { case (row, col, value) => // filling the newValues } new DenseMatrix(numRows, numCols, newValues, isTransposed = !colMajor) case _ => throw new IllegalArgumentException("") } } } class DenseMatrix extends Matrix { def foreachActive(f: (Int, Int, Double) => Unit): Unit = { } } class SparseMatrix extends Matrix { def foreachActive(f: (Int, Int, Double) => Unit): Unit = { } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107313343 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,385 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) --- End diff -- Why not just make `dm2` dm1.transposed, but explicitly assign the value? Thus, you don't need to type the value in the array for the comparison. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107312629 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/MatricesSuite.scala --- @@ -160,22 +160,385 @@ class MatricesSuite extends SparkMLFunSuite { assert(sparseMat.values(2) === 10.0) } - test("toSparse, toDense") { -val m = 3 -val n = 2 -val values = Array(1.0, 2.0, 4.0, 5.0) -val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0) -val colPtrs = Array(0, 2, 4) -val rowIndices = Array(0, 1, 1, 2) + test("dense to dense") { +/* + dm1 = 4.0 2.0 -8.0 +-1.0 7.0 4.0 + + dm2 = 5.0 -9.0 4.0 +1.0 -3.0 -8.0 + */ +val dm1 = new DenseMatrix(2, 3, Array(4.0, -1.0, 2.0, 7.0, -8.0, 4.0)) +val dm2 = new DenseMatrix(2, 3, Array(5.0, -9.0, 4.0, 1.0, -3.0, -8.0), isTransposed = true) + +val dm3 = dm1.toDense +assert(dm3 === dm1) +assert(!dm3.isTransposed) +assert(dm3.values.equals(dm1.values)) + +val dm4 = dm1.toDenseMatrix(false) --- End diff -- I would like to make `toDenseMatrix` as private, and we test against `toDenseRowMajor` which is more explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107312905 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,110 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSR: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) --- End diff -- I'm debating that should we keep the same ordering of layout when we call `toSparse` or `toDense`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107306663 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,109 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) --- End diff -- I'm not good at naming, but since we use `toDenseRowMajor` for dense vector, should we use `toSparseColumnMajor`? Almost many packages are using `toCSC`, but I think we can make them consistent. Just my 2 cents. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107312194 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -587,18 +722,69 @@ class SparseMatrix @Since("2.0.0") ( } } + override def numNonzeros: Int = values.count(_ != 0) + + override def numActives: Int = values.length + /** - * Generate a `DenseMatrix` from the given `SparseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from this `SparseMatrix`, removing explicit zero values if they + * exist. + * + * @param colMajor Whether or not the resulting `SparseMatrix` values are in column major + *order. */ - @Since("2.0.0") - def toDense: DenseMatrix = { -new DenseMatrix(numRows, numCols, toArray) + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!(colMajor ^ isTransposed)) { + // breeze transpose rearranges values in column major and removes explicit zeros + if (!isTransposed) { +// it is row major and we want col major +val breezeTransposed = asBreeze.asInstanceOf[BSM[Double]].t + Matrices.fromBreeze(breezeTransposed).transpose.asInstanceOf[SparseMatrix] + } else { +// it is col major and we want row major +val breezeTransposed = asBreeze.asInstanceOf[BSM[Double]] +Matrices.fromBreeze(breezeTransposed).asInstanceOf[SparseMatrix] + } +} else { --- End diff -- Can we document here that it's when the layout of this and colMajor is different? Easier read than `(colMajor ^ isTranspose)` condition here. Even more readable to use pattern matching with exact boolean on both variables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107307120 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,109 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSR: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix in row major order. + */ + @Since("2.2.0") + def toDenseRowMajor: DenseMatrix = toDenseMatrix(colMajor = false) + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDenseColMajor: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in either dense or sparse format, whichever uses less storage. + * + * @param colMajor Whether the values of the resulting matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + @Since("2.2.0") + def compressed(colMajor: Boolean): Matrix = { --- End diff -- Let's make it private, and follow the previous style. Should add `compressedRowMajor`, `compressedColumnMajor` since it can be dense matrix in certain situations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107309786 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -291,31 +395,60 @@ class DenseMatrix @Since("2.0.0") ( override def numActives: Int = values.length /** - * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from the given `DenseMatrix`. + * + * @param colMajor Whether the resulting `SparseMatrix` values will be in column major order. */ - @Since("2.0.0") - def toSparse: SparseMatrix = { -val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble -val colPtrs: Array[Int] = new Array[Int](numCols + 1) -val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt -var nnz = 0 -var j = 0 -while (j < numCols) { - var i = 0 - while (i < numRows) { -val v = values(index(i, j)) -if (v != 0.0) { - rowIndices += i - spVals += v - nnz += 1 + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!colMajor) this.transpose.toSparseMatrix(colMajor = true).transpose +else { + val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble + val colPtrs: Array[Int] = new Array[Int](numCols + 1) + val rowIndices: MArrayBuilder[Int] = new MArrayBuilder.ofInt + var nnz = 0 + var j = 0 + while (j < numCols) { +var i = 0 +while (i < numRows) { + val v = values(index(i, j)) + if (v != 0.0) { +rowIndices += i +spVals += v +nnz += 1 + } + i += 1 } -i += 1 +j += 1 +colPtrs(j) = nnz } - j += 1 - colPtrs(j) = nnz + new SparseMatrix(numRows, numCols, colPtrs, rowIndices.result(), spVals.result()) +} + } + + /** + * Generate a `DenseMatrix` from this `DenseMatrix`. + * + * @param colMajor Whether the resulting `DenseMatrix` values will be in column major order. + */ + private[ml] override def toDenseMatrix(colMajor: Boolean): DenseMatrix = { +if (!(isTransposed ^ colMajor)) { + val newValues = new Array[Double](numCols * numRows) --- End diff -- Simpler to use `foreachActive`? With it, both `toDenseMatrix` can have the same implementation for sparse and dense in trait. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107311742 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -587,18 +722,69 @@ class SparseMatrix @Since("2.0.0") ( } } + override def numNonzeros: Int = values.count(_ != 0) + + override def numActives: Int = values.length + /** - * Generate a `DenseMatrix` from the given `SparseMatrix`. The new matrix will have isTransposed - * set to false. + * Generate a `SparseMatrix` from this `SparseMatrix`, removing explicit zero values if they + * exist. + * + * @param colMajor Whether or not the resulting `SparseMatrix` values are in column major + *order. */ - @Since("2.0.0") - def toDense: DenseMatrix = { -new DenseMatrix(numRows, numCols, toArray) + private[ml] override def toSparseMatrix(colMajor: Boolean): SparseMatrix = { +if (!(colMajor ^ isTransposed)) { + // breeze transpose rearranges values in column major and removes explicit zeros --- End diff -- I think it's hacky to use breeze's transpose behavior to remove zeros in sparse matrices. Can we have our own implementation given we're potentially remove breeze? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107306774 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -161,6 +162,109 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSC: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSR: SparseMatrix = toSparseMatrix(colMajor = false) --- End diff -- Same question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107303875 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in either dense or sparse format, whichever uses less storage. + * + * @param colMajor Whether the values of the resulting matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + @Since("2.2.0") + def compressed(colMajor: Boolean): Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor)) { + toDenseMatrix(colMajor) +} else { + toSparseMatrix(colMajor) +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { --- End diff -- +1 on the later one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r107303720 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix --- End diff -- Fair enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105272926 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { */ --- End diff -- Should be fine. Small enough change :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105274636 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix --- End diff -- we may `inline` this as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105272811 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + + /** + * Returns a matrix in either dense or sparse format, whichever uses less storage. + * + * @param colMajor Whether the values of the resulting matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + @Since("2.2.0") + def compressed(colMajor: Boolean): Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(colMajor)) { + toDenseMatrix(colMajor) +} else { + toSparseMatrix(colMajor) +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.2.0") + def compressed: Matrix = { --- End diff -- Won't `compressed(colMajor: Boolean)` and `compressed` cause the overloading ambiguous issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105273751 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) --- End diff -- `toCSR`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105273579 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) --- End diff -- How about we follow [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html), and call it as `toCSC`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105273397 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix --- End diff -- Maybe we can `inline` this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105274573 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +154,97 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param colMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(colMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toCSCMatrix: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a sparse matrix in row major order. + */ + @Since("2.2.0") + def toCSRMatrix: SparseMatrix = toSparseMatrix(colMajor = false) + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.2.0") + def toSparse: SparseMatrix = toSparseMatrix(colMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param colMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toDenseMatrix(colMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.2.0") + def toDense: DenseMatrix = toDenseMatrix(colMajor = true) + --- End diff -- Could we add `toColumnMajorDense` and `toRowMajorDense`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105032243 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { */ --- End diff -- Can you make `foreachActive(f: (Int, Int, Double) => Unit)` public? This is public for vector. I believe it will be very useful, and I think it's stable enough to make it public. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r105068149 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param columnMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix --- End diff -- Yeah, this is very hacky in my opinion too! The problem is that when one overloads a function without parenthesis, the ambiguity happens because when that function is invoked without parenthesis, this can be calling the actual function without parenthesis, or getting the function with parenthesis. The following is an example demonstrating the issue. In my opinion, I would like to call it `toSparse(columnMajor: Boolean)` and `toSparse() = toSparse(true)`, but in the vector api, we already use the one without parenthesis, so it will result inconsistency in the api design. I think exposing the ability of converting it to `columnMajor` or `rowMajor` is very useful, as a result, we can expose it as `toCSRMatrix`, `toCSCMatrix`, and `toSparse` which converts the matrix to the one with smallest storage. ```scala scala> trait A { | def foo(b: Boolean): String | def foo: String = foo(true) | } defined trait A scala> class B extends A { | def foo(b: Boolean): String = b.toString | } defined class B scala> val b = new B b: B = B@67b6d4ae scala> b.foo :18: error: ambiguous reference to overloaded definition, both method foo in class B of type (b: Boolean)String and method foo in trait A of type => String match expected type ? b.foo ^ scala> val x: String = b.foo x: String = true scala> val y: Boolean=> String = b.foo y: Boolean => String = ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r87668326 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param columnMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.1.0") + def toSparse: SparseMatrix = toSparseMatrix(columnMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param columnMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private [ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.1.0") + def toDense: DenseMatrix = toDenseMatrix(columnMajor = true) --- End diff -- Nit, since we're using `numCols` already, should we call it `colMajor`? I saw couple packages using `colMajor` as the variable name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r87669463 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -153,6 +153,86 @@ sealed trait Matrix extends Serializable { */ @Since("2.0.0") def numActives: Int + + /** + * Converts this matrix to a sparse matrix. + * + * @param columnMajor Whether the values of the resulting sparse matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix + + /** + * Converts this matrix to a sparse matrix in column major order. + */ + @Since("2.1.0") + def toSparse: SparseMatrix = toSparseMatrix(columnMajor = true) + + /** + * Converts this matrix to a dense matrix. + * + * @param columnMajor Whether the values of the resulting dense matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + private [ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix + + /** + * Converts this matrix to a dense matrix in column major order. + */ + @Since("2.1.0") + def toDense: DenseMatrix = toDenseMatrix(columnMajor = true) + + /** + * Returns a matrix in either dense or sparse format, whichever uses less storage. + * + * @param columnMajor Whether the values of the resulting matrix should be in column major + *or row major order. If `false`, resulting matrix will be row major. + */ + @Since("2.1.0") + def compressed(columnMajor: Boolean): Matrix = { +if (getDenseSizeInBytes < getSparseSizeInBytes(columnMajor)) { + toDenseMatrix(columnMajor) +} else { + toSparseMatrix(columnMajor) +} + } + + /** + * Returns a matrix in dense column major, dense row major, sparse row major, or sparse column + * major format, whichever uses less storage. When dense representation is optimal, it maintains + * the current layout order. + */ + @Since("2.1.0") + def compressed: Matrix = { +val cscSize = getSparseSizeInBytes(columnMajor = true) +val csrSize = getSparseSizeInBytes(columnMajor = false) +val minSparseSize = cscSize.min(csrSize) +if (getDenseSizeInBytes < minSparseSize) { + // size is the same either way, so maintain current layout --- End diff -- ``` scala if (getDenseSizeInBytes < math.min(cscSize, csrSize)) ... ... if (cscSize < csrSize) ``` could be easier to read. Also, can you elaborate the comment like ``` // sizes for dense matrix in row major or column major are the same, so maintain current layout ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r105025830 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.2.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- Can you elaborate this? I don't fully understand why we can not overwrite setting in set method. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r105025280 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") ( model } + /** + * Check validity for interactions between parameters. + */ + private def assertInitialModelValid(): Unit = { +if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) { + if (isSet(initialModel)) { +val initialModelK = $(initialModel).parentModel.k +if (initialModelK != $(k)) { + throw new IllegalArgumentException("The initial model's cluster count = " + +s"$initialModelK, mismatched with k = $k.") +} + } else { +throw new IllegalArgumentException("Users must set param initialModel if you choose " + + "'initialModel' as the initialization algorithm.") + } +} else { + if (isSet(initialModel)) { +logWarning(s"Param initialModel will take no effect when initMode is $initMode.") + } +} + } + @Since("1.5.0") override def transformSchema(schema: StructType): StructType = { +assertInitialModelValid() --- End diff -- `transformSchema ` is called in `transform` method, and `model.transform` is called in computing the summary. I think we should fail it earlier instead of checking it in the end. Also, it's implicit that it's being checked when computing summary. We should explicitly check it. If we have small logic in checking, I'll have those checking code in `fit` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r105002771 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel import org.apache.spark.util.VersionUtils.majorVersion /** - * Common params for KMeans and KMeansModel + * Common params for KMeans and KMeansModel. */ -private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol +private[clustering] trait KMeansModelParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { --- End diff -- Fair enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r105002617 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("training with initial model") { +val kmeans = new KMeans().setK(2).setSeed(1) +val model1 = kmeans.fit(rData) +val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData) +model2.clusterCenters.zip(model1.clusterCenters) + .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) } + } + + test("training with initial model, error cases") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +// Sets initMode with 'initialModel', but does not specify initial model. +intercept[IllegalArgumentException] { --- End diff -- @sethah +1 on the behavior you purpose. The only thing I would like to add on is `setK` should throw `IllegalArgumentException`. ```scala // initialModel sets k and init mode assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL) assert(km.getK === initialK) assert(km.getInitialModel.getK === initialK) // setting k will throw exception. intercept[IllegalArgumentException] { km.setK(initialK + 1) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104820054 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") ( model } + /** + * Check validity for interactions between parameters. + */ + private def assertInitialModelValid(): Unit = { +if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) { + if (isSet(initialModel)) { +val initialModelK = $(initialModel).parentModel.k +if (initialModelK != $(k)) { + throw new IllegalArgumentException("The initial model's cluster count = " + +s"$initialModelK, mismatched with k = $k.") +} + } else { +throw new IllegalArgumentException("Users must set param initialModel if you choose " + + "'initialModel' as the initialization algorithm.") + } +} else { + if (isSet(initialModel)) { --- End diff -- Also, this is not needed if we do the overwriting work in `setInitialModel`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104819439 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") ( model } + /** + * Check validity for interactions between parameters. + */ + private def assertInitialModelValid(): Unit = { +if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) { + if (isSet(initialModel)) { +val initialModelK = $(initialModel).parentModel.k +if (initialModelK != $(k)) { + throw new IllegalArgumentException("The initial model's cluster count = " + +s"$initialModelK, mismatched with k = $k.") +} + } else { +throw new IllegalArgumentException("Users must set param initialModel if you choose " + + "'initialModel' as the initialization algorithm.") + } +} else { + if (isSet(initialModel)) { +logWarning(s"Param initialModel will take no effect when initMode is $initMode.") + } +} + } + @Since("1.5.0") override def transformSchema(schema: StructType): StructType = { +assertInitialModelValid() --- End diff -- Why this is not checked in `fit`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104821332 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") ( model } + /** + * Check validity for interactions between parameters. + */ + private def assertInitialModelValid(): Unit = { --- End diff -- I think with overwriting above, the only thing we need to check will be ```scala if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL && !isSet(initialModel)) { throw new IllegalArgumentException("Users must set param initialModel if you choose " + "'initialModel' as the initialization.") } ``` we can just have it in the body of `fit` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104803180 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel import org.apache.spark.util.VersionUtils.majorVersion /** - * Common params for KMeans and KMeansModel + * Common params for KMeans and KMeansModel. */ -private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol +private[clustering] trait KMeansModelParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { --- End diff -- Now, `KMeansModel` mixes `KMeansModelParams`, does it mean in the model level, we can not get the information of the `initiModel`? Also, in the model, why do we need to mix the seed in? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104819810 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") ( model } + /** + * Check validity for interactions between parameters. + */ + private def assertInitialModelValid(): Unit = { +if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) { + if (isSet(initialModel)) { +val initialModelK = $(initialModel).parentModel.k +if (initialModelK != $(k)) { --- End diff -- I don't think this check is needed if we overwrite `k` when `initialModel` is set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104815453 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.2.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- How about ```scala def setInitialModel(value: KMeansModel): this.type = { if (getK ~= value.getK) { log the warning set(k, value) } set(initMode, MLlibKMeans.K_MEANS_INITIAL_MODEL) // We may log, but I don't really care for this one. set(initialModel, value) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/17117#discussion_r104800325 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("training with initial model") { +val kmeans = new KMeans().setK(2).setSeed(1) +val model1 = kmeans.fit(rData) +val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData) +model2.clusterCenters.zip(model1.clusterCenters) + .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) } + } + + test("training with initial model, error cases") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +// Sets initMode with 'initialModel', but does not specify initial model. +intercept[IllegalArgumentException] { --- End diff -- My 2cents is the latter configuration should be able to overwrite the former settings and related settings with warning messages. In your example, when `kmeans.setInitMode("k-means||")` is performed, the first `setInitialModel` should be ignored with warning message. Even we do `setK(k =3)`, and later we do `.setInitialModel(initialModel)`, we should ignore the first `setK(k =3)` with warning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17078: [SPARK-19746][ML] Faster indexing for logistic aggregato...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/17078 Thanks. Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org