[GitHub] spark pull request: [SPARK-1892][MLLIB] Adding OWL-QN optimizer fo...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/840#issuecomment-57439459 @debasish83 and @codedeft The weighted method for OWLQN in breeze is merged https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c I will submit a PR to Spark to use newer version of breeze with this feature once @dlwh publishes to this to maven. But there is still some work in mllib side to have it working properly. I'll work on this once I'm back from vacation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3119] Re-implementation of TorrentBroad...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2030#issuecomment-58183559 We had a build against the spark master on Oct 2, and when ran our application with data around 600GB, we got the following exception. Does this PR fix this issue which is seen by @JoshRosen Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 8312, ams03-002.ff): java.io.IOException: PARSING_ERROR(2) org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1004) org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89) org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3832][MLlib] Upgrade Breeze dependency ...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/2693 [SPARK-3832][MLlib] Upgrade Breeze dependency to 0.10 In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark breeze0.10 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2693.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2693 commit 7a0c45cda7d388152774722a2f6728294cc81b4e Author: DB Tsai dbt...@dbtsai.com Date: 2014-10-07T14:20:41Z In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3119] Re-implementation of TorrentBroad...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2030#issuecomment-58214186 I thought it was a close issue, so I moved my comment to JIRA. I ran into this issue in spark-shell not the standalone application, does SPARK-3762 apply in this situation? Thanks. Sent from my Google Nexus 5 On Oct 7, 2014 5:17 PM, Davies Liu notificati...@github.com wrote: It could be fixed by https://github.com/apache/spark/pull/2624 It's strange that I can not see this comment on PR #2030. On Tue, Oct 7, 2014 at 6:28 AM, DB Tsai notificati...@github.com wrote: We had a build against the spark master on Oct 2, and when ran our application with data around 600GB, we got the following exception. Does this PR fix this issue which is seen by @JoshRosen https://github.com/JoshRosen Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 8312, ams03-002.ff): java.io.IOException: PARSING_ERROR(2) org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1004) org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89) org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: -- Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2030#issuecomment-58183559. -- - Davies â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2030#issuecomment-58201237. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3832][MLlib] Upgrade Breeze dependency ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2693#issuecomment-58276308 @dlwh David, do you know if there is dependency change in breeze-0.10 and is it compatible with both scala 2.10 and 2.11? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1518 [SPARK-2505][MLlib] Weighted Regularizer for Generalized Linear Model (Note: This is not ready to be merged. Need documentation, and make sure it's backforwad compatible with Spark 1.0 apis). The current implementation of regularization in linear model is using `Updater`, and this design has couple issues as the following. 1) It will penalize all the weights including intercept. In machine learning training process, typically, people don't penalize the intercept. 2) The `Updater` has the logic of adaptive step size for gradient decent, and we would like to clean it up by separating the logic of regularization out from updater to regularizer so in LBFGS optimizer, we don't need the trick for getting the loss and gradient of objective function. In this work, a weighted regularizer will be implemented, and users can exclude the intercept or any weight from regularization by setting that term with zero weighted penalty. Since the regularizer will return a tuple of loss and gradient, the adaptive step size logic, and soft thresholding for L1 in Updater will be moved to SGD optimizer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark SPARK-2505_regularizer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1518 commit 2946930ec3de0e0a34e07d065c954d7aabacd4ba Author: DB Tsai dbt...@alpinenow.com Date: 2014-07-19T02:15:37Z initial work --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-49682150 I think it fails due to the apache license is not in the test file. As you suggest, I'll move it to be generated in the runtime. Would like to know the general feedback. I'll make the test pass tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49682436 `!~==` will be used in the test since `!(a~==b)` will not work due to that (a~==b) is not returning false but throwing exception for messaging. I will replace the almostEquals with `~==`. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49954543 @srowen @mengxr and @dorx Based on our discussion, I've implemented two different APIs for relative error, and absolute error. It makes sense that test writers should know which one they need depending on their circumstances. Developers also need to explicitly specify the eps now, and there is no default value which will sometimes cause confusion. When comparing against zero using relative error, a exception will be raised to warn users that it's meaningless. For relative error in percentage, users can now write assert(23.1 ~== 23.52 %+- 2.0) assert(23.1 ~== 22.74 %+- 2.0) assert(23.1 ~= 23.52 %+- 2.0) assert(23.1 ~= 22.74 %+- 2.0) assert(!(23.1 !~= 23.52 %+- 2.0)) assert(!(23.1 !~= 22.74 %+- 2.0)) // This will throw exception with the following message. // Did not expect 23.1 and 23.52 to be within 2.0% using relative error. assert(23.1 !~== 23.52 %+- 2.0) // Expected 23.1 and 22.34 to be within 2.0% using relative error. assert(23.1 ~== 22.34 %+- 2.0) For absolute error, assert(17.8 ~== 17.99 +- 0.2) assert(17.8 ~== 17.61 +- 0.2) assert(17.8 ~= 17.99 +- 0.2) assert(17.8 ~= 17.61 +- 0.2) assert(!(17.8 !~= 17.99 +- 0.2)) assert(!(17.8 !~= 17.61 +- 0.2)) // This will throw exception with the following message. // Did not expect 17.8 and 17.99 to be within 0.2 using absolute error. assert(17.8 !~== 17.99 +- 0.2) // Expected 17.8 and 17.59 to be within 0.2 using absolute error. assert(17.8 ~== 17.59 +- 0.2) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479 (partial)][MLLIB] fix binary metri...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1576#issuecomment-50057950 @mengxr Feel free to merge this one first. After you merge, I'll rebase #1425 against current master, and address the conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-50064963 @mengxr `%+-` is used as an operator to indicate this is relative error. Users can write `assert(a ~== b %+- 1E-10)` for relative error, and `assert(a ~== b +- 1E-10)` for absolute error. As a result, the syntactic sugar would be the same as scalatest for absolute error except they use `===` instead of `~==`. On the other hand, however, using `absErr`/`relErr` seems to be easier to remember. I'm open to both, and it's easy to change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-50081864 @mengxr I just rebased against master, and it passes the test. Depending on whether we want to use `absErr`/`relErr`, `+-`/`%+-` or both, I can do further modification. Tks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1425#discussion_r15443103 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -40,27 +41,51 @@ class KMeansSuite extends FunSuite with LocalSparkContext { // No matter how many runs or iterations we use, we should get one cluster, // centered at the mean of the points + HEAD --- End diff -- Tried to rebase against master with conflicts. I addressed them in the next push. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-50293096 @mengxr Resolved all the conflicts after rebasing, and all the unittests are passed. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1518#issuecomment-50663418 I tried to make the bias really big to make the intercept smaller to avoid being regularized. The result is still quite different from R, and very sensitive to the strength of bias. Users may re-scale the features to improve the convergence of optimization process, and in order to get the same coefficients without scaling, each component has to be penalized differently. Also, users may know which feature is less important, and want to penalize more. As a result, I still want to implement the full weighted regualizer, and de-couple the adaptive learning rate from updater. Let's talk in detail when we meet tomorrow. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-50982699 @mengxr Is there any problem with asfgit? This is not finished yet, why asfgit said it's merged into apache:master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733217 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n 0) --- End diff -- This is Int. As long as we require p 0; it implies p = 0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733221 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n 0) --- End diff -- I made it more explicit for not saving one cpu cycle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733244 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector = BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense + * output, so this does not work on sparse input and will raise an exception. + * @param withStd True by default. Scales the data to unit standard deviation. --- End diff -- sklearn.preprocessing.StandardScaler has this API. If we want to minimize the set of parameters now, we can remove it for this release. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733248 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Trait for transformation of a vector + */ +@DeveloperApi +trait VectorTransformer { + + /** + * Applies transformation on a vector. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + def transform(vector: Vector): Vector + + /** + * Applies transformation on a RDD[Vector]. + * + * @param data RDD[Vector] to be transformed. + * @return transformed RDD[Vector]. + */ + def transform(data: RDD[Vector]): RDD[Vector] = data.map(x = this.transform(x)) --- End diff -- Can you elaborate this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15738936 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector = BDV, SparseVector = BSV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^p norm + * + * For any 1 = p Double.Infinity, normalizes samples using sum(abs(vector).^p)^(1/p) as norm. + * For p = Double.Infinity, max(abs(vector)) will be used as norm for normalization. + * For p = Double.NegativeInfinity, min(abs(vector)) will be used as norm for normalization. --- End diff -- matlab has L_{-inf} http://www.mathworks.com/help/matlab/ref/norm.html for min(abs(X)). I agree that it's not useful for sparse data. Gonna remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15740021 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector = BDV, SparseVector = BSV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^p^ norm --- End diff -- lol... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15740240 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/StandardScalerSuite.scala --- @@ -0,0 +1,208 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.scalatest.FunSuite + +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors} +import org.apache.spark.mllib.util.LocalSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, MultivariateOnlineSummarizer} +import org.apache.spark.rdd.RDD + +class StandardScalerSuite extends FunSuite with LocalSparkContext { + + private def computeSummary(data: RDD[Vector]): MultivariateStatisticalSummary = { +data.treeAggregate(new MultivariateOnlineSummarizer)( + (aggregator, data) = aggregator.add(data), + (aggregator1, aggregator2) = aggregator1.merge(aggregator2)) + } + + test(Standardization with dense input) { +val data = Array( + Vectors.dense(-2.0, 2.3, 0), + Vectors.dense(0.0, -1.0, -3.0), + Vectors.dense(0.0, -5.1, 0.0), + Vectors.dense(3.8, 0.0, 1.9), + Vectors.dense(1.7, -0.6, 0.0), + Vectors.dense(0.0, 1.9, 0.0) +) + +val dataRDD = sc.parallelize(data, 3) + +val standardizer1 = new StandardScaler(withMean = true, withStd = true) +val standardizer2 = new StandardScaler() +val standardizer3 = new StandardScaler(withMean = true, withStd = false) + +withClue(Using a standardizer before fitting the model should throw exception.) { + intercept[IllegalStateException] { +data.map(standardizer1.transform) + } +} + +standardizer1.fit(dataRDD) +standardizer2.fit(dataRDD) +standardizer3.fit(dataRDD) + +val data1 = data.map(standardizer1.transform) +val data2 = data.map(standardizer2.transform) +val data3 = data.map(standardizer3.transform) + +val data1RDD = standardizer1.transform(dataRDD) +val data2RDD = standardizer2.transform(dataRDD) +val data3RDD = standardizer3.transform(dataRDD) + +val summary = computeSummary(dataRDD) +val summary1 = computeSummary(data1RDD) +val summary2 = computeSummary(data2RDD) +val summary3 = computeSummary(data3RDD) + +assert((data, data1, data1RDD.collect()).zipped.forall( +(v1, v2, v3) = (v1, v2, v3) match { + case (v1: DenseVector, v2: DenseVector, v3: DenseVector) = true + case (v1: SparseVector, v2: SparseVector, v3: SparseVector) = true + case _ = false +} + ), The vector type should be preserved after standardization.) + +assert((data, data2, data2RDD.collect()).zipped.forall( +(v1, v2, v3) = (v1, v2, v3) match { + case (v1: DenseVector, v2: DenseVector, v3: DenseVector) = true + case (v1: SparseVector, v2: SparseVector, v3: SparseVector) = true + case _ = false +} + ), The vector type should be preserved after standardization.) + +assert((data, data3, data3RDD.collect()).zipped.forall( +(v1, v2, v3) = (v1, v2, v3) match { + case (v1: DenseVector, v2: DenseVector, v3: DenseVector) = true + case (v1: SparseVector, v2: SparseVector, v3: SparseVector) = true + case _ = false +} + ), The vector type should be preserved after standardization.) + +assert((data1, data1RDD.collect()).zipped.forall((v1, v2) = v1 ~== v2 absTol 1E-5)) --- End diff -- For each RDD, I just call twice of collect(). I don't want to add another variable for this. (ps, RDD version is used for computing the summary stats, so we need both
[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1518#issuecomment-51151346 It's too late to get into 1.1, but I'll try to make it happen in 1.2. We'll use this at Alpine implementation first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] Use this.type as return type in k-mean...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1796 [MLlib] Use this.type as return type in k-means' builder pattern to ensure that the return object is itself. You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark dbtsai-kmeans Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1796.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1796 commit 658989ef591ad28f891b275ccdc8137c5c180f46 Author: DB Tsai dbt...@alpinenow.com Date: 2014-08-06T01:30:32Z Alpine Data Labs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1814#discussion_r15908219 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD * @param withStd True by default. Scales the data to unit standard deviation. */ @Experimental -class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer { +class StandardScaler(withMean: Boolean, withStd: Boolean) { def this() = this(false, true) require(withMean || withStd, swithMean and withStd both equal to false. Doing nothing.) - private var mean: BV[Double] = _ - private var factor: BV[Double] = _ - /** * Computes the mean and variance and stores as a model to be used for later scaling. * * @param data The data used to compute the mean and variance to build the transformation model. - * @return This StandardScalar object. + * @return a StandardScalarModel */ - def fit(data: RDD[Vector]): this.type = { + def fit(data: RDD[Vector]): StandardScalerModel = { val summary = data.treeAggregate(new MultivariateOnlineSummarizer)( (aggregator, data) = aggregator.add(data), (aggregator1, aggregator2) = aggregator1.merge(aggregator2)) -mean = summary.mean.toBreeze -factor = summary.variance.toBreeze -require(mean.length == factor.length) +val mean = summary.mean.toBreeze +val factor = summary.variance.toBreeze +require(mean.size == factor.size) var i = 0 -while (i factor.length) { +while (i factor.size) { factor(i) = if (factor(i) != 0.0) 1.0 / math.sqrt(factor(i)) else 0.0 i += 1 } -this +new StandardScalerModel(withMean, withStd, mean, factor) } +} + +/** + * :: Experimental :: + * Represents a StandardScaler model that can transform vectors. + */ +@Experimental +class StandardScalerModel private[mllib] ( +val withMean: Boolean, +val withStd: Boolean, +val mean: BV[Double], +val factor: BV[Double]) extends VectorTransformer { --- End diff -- Since users may want to know the variance of the training set, should we have constructor class StandardScalerModel private[mllib] ( val withMean: Boolean, val withStd: Boolean, val mean: BV[Double], val variance: BV[Double]) { lazy val factor = { val temp = variance.clone while (i temp.size) { temp(i) = if (temp(i) != 0.0) 1.0 / math.sqrt(temp(i)) else 0.0 i += 1 temp } } } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1814#discussion_r15908318 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD * @param withStd True by default. Scales the data to unit standard deviation. */ @Experimental -class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer { +class StandardScaler(withMean: Boolean, withStd: Boolean) { --- End diff -- This class is only used for keeping the state of withMean, and withStd, is it possible to move those states to fit function by overloading, and make it as object? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1814#discussion_r15908504 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -177,18 +115,72 @@ private object IDF { private def isEmpty: Boolean = m == 0L /** Returns the current IDF vector. */ -def idf(): BDV[Double] = { +def idf(): Vector = { if (isEmpty) { throw new IllegalStateException(Haven't seen any document yet.) } val n = df.length - val inv = BDV.zeros[Double](n) + val inv = new Array[Double](n) var j = 0 while (j n) { inv(j) = math.log((m + 1.0)/ (df(j) + 1.0)) j += 1 } - inv + Vectors.dense(inv) } } } + +/** + * :: Experimental :: + * Represents an IDF model that can transform term frequency vectors. + */ +@Experimental +class IDFModel private[mllib] (val idf: Vector) extends Serializable { + + /** + * Transforms term frequency (TF) vectors to TF-IDF vectors. + * @param dataset an RDD of term frequency vectors + * @return an RDD of TF-IDF vectors + */ + def transform(dataset: RDD[Vector]): RDD[Vector] = { +val bcIdf = dataset.context.broadcast(idf) +dataset.mapPartitions { iter = + val thisIdf = bcIdf.value + iter.map { v = +val n = v.size +v match { + case sv: SparseVector = +val nnz = sv.indices.size +val newValues = new Array[Double](nnz) +var k = 0 +while (k nnz) { + newValues(k) = sv.values(k) * thisIdf(sv.indices(k)) + k += 1 +} +Vectors.sparse(n, sv.indices, newValues) + case dv: DenseVector = +val newValues = new Array[Double](n) +var j = 0 +while (j n) { + newValues(j) = dv.values(j) * thisIdf(j) + j += 1 +} +Vectors.dense(newValues) + case other = +throw new UnsupportedOperationException( --- End diff -- The following exception is used for unsupported vector in appendBias and StandardScaler, maybe we could have a global definition of this in util. case v = throw new IllegalArgumentException(Do not support vector type + v.getClass) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1814#issuecomment-51511617 LGTM. Merged into both master and branch-1.1. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1862 [SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface for training with LBFGS Optimizer which will converge faster than SGD. You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark dbtsai-lbfgs-lor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1862.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1862 commit 3cf50c207e79c5f67cd5d06ff3f85f3538c23081 Author: DB Tsai dbt...@alpinenow.com Date: 2014-08-08T23:23:21Z LogisticRegressionWithLBFGS interface --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1862#discussion_r16022431 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -188,3 +188,98 @@ object LogisticRegressionWithSGD { train(input, numIterations, 1.0, 1.0) } } + +/** + * Train a classification model for Logistic Regression using Limited-memory BFGS. + * NOTE: Labels used in Logistic Regression should be {0, 1} + */ +class LogisticRegressionWithLBFGS private ( +private var convergenceTol: Double, +private var maxNumIterations: Int, +private var regParam: Double) + extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable { + + private val gradient = new LogisticGradient() + private val updater = new SimpleUpdater() + override val optimizer = new LBFGS(gradient, updater) +.setNumCorrections(10) +.setConvergenceTol(convergenceTol) +.setMaxNumIterations(maxNumIterations) +.setRegParam(regParam) + + override protected val validators = List(DataValidators.binaryLabelValidator) + + /** + * Construct a LogisticRegression object with default parameters + */ + def this() = this(1E-4, 100, 0.0) + + override protected def createModel(weights: Vector, intercept: Double) = { +new LogisticRegressionModel(weights, intercept) + } +} + +/** + * Top-level methods for calling Logistic Regression using Limited-memory BFGS. + * NOTE: Labels used in Logistic Regression should be {0, 1} + */ +object LogisticRegressionWithLBFGS { --- End diff -- I don't mind about this. However, it will cause inconsistent api compared with LogisticRegressionWithSGD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1862#discussion_r16023077 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -188,3 +188,54 @@ object LogisticRegressionWithSGD { train(input, numIterations, 1.0, 1.0) } } + +/** + * Train a classification model for Logistic Regression using Limited-memory BFGS. + * NOTE: Labels used in Logistic Regression should be {0, 1} + */ +class LogisticRegressionWithLBFGS private ( +private var convergenceTol: Double, +private var maxNumIterations: Int, +private var regParam: Double) + extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable { + + private val gradient = new LogisticGradient() + private val updater = new SimpleUpdater() + // Have to be lazy since users can change the parameters after the class is created. + // PS, after the first train, the optimizer variable will be computed, so the parameters + // can not be changed anymore. + override lazy val optimizer = new LBFGS(gradient, updater) +.setNumCorrections(10) +.setConvergenceTol(convergenceTol) +.setMaxNumIterations(maxNumIterations) +.setRegParam(regParam) + + override protected val validators = List(DataValidators.binaryLabelValidator) + + /** + * Construct a LogisticRegression object with default parameters + */ + def this() = this(1E-4, 100, 0.0) + + /** + * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4. + * Smaller value will lead to higher accuracy with the cost of more iterations. + */ + def setConvergenceTol(tolerance: Double): this.type = { +this.convergenceTol = tolerance +this + } + + /** + * Set the maximal number of iterations for L-BFGS. Default 100. + */ + def setMaxNumIterations(iters: Int): this.type = { --- End diff -- agreed! should we also change for the api in the optimizer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1862#discussion_r16023299 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -188,3 +188,54 @@ object LogisticRegressionWithSGD { train(input, numIterations, 1.0, 1.0) } } + +/** + * Train a classification model for Logistic Regression using Limited-memory BFGS. + * NOTE: Labels used in Logistic Regression should be {0, 1} + */ +class LogisticRegressionWithLBFGS private ( +private var convergenceTol: Double, +private var maxNumIterations: Int, +private var regParam: Double) + extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable { + + private val gradient = new LogisticGradient() + private val updater = new SimpleUpdater() + // Have to be lazy since users can change the parameters after the class is created. + // PS, after the first train, the optimizer variable will be computed, so the parameters + // can not be changed anymore. + override lazy val optimizer = new LBFGS(gradient, updater) +.setNumCorrections(10) +.setConvergenceTol(convergenceTol) +.setMaxNumIterations(maxNumIterations) +.setRegParam(regParam) + + override protected val validators = List(DataValidators.binaryLabelValidator) + + /** + * Construct a LogisticRegression object with default parameters + */ + def this() = this(1E-4, 100, 0.0) + + /** + * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4. + * Smaller value will lead to higher accuracy with the cost of more iterations. + */ + def setConvergenceTol(tolerance: Double): this.type = { +this.convergenceTol = tolerance +this + } + + /** + * Set the maximal number of iterations for L-BFGS. Default 100. + */ + def setMaxNumIterations(iters: Int): this.type = { --- End diff -- LBFGS.setMaxNumIterations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1897 [SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition number Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1897.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1897 commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7 Author: DB Tsai dbt...@alpinenow.com Date: 2014-08-08T23:23:21Z Improve the convergence rate by minimize the condition number in LOR with LBFGS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1897#discussion_r16153527 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala --- @@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : GeneralizedLinearModel] throw new SparkException(Input validation failed.) } +/** + * Scaling to minimize the condition number: + * + * During the optimization process, the convergence (rate) depends on the condition number of + * the training dataset. Scaling the variables often reduces this condition number, thus + * improving the convergence rate dramatically. Without reducing the condition number, + * some training datasets mixing the columns with different scales may not be able to converge. + * + * GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return + * the weights in the original scale. + * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf + * + * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing + * the variance of each column (without subtracting the mean), and train the model in the + * scaled space. Then we transform the coefficients from the scaled space to the original scale + * as GLMNET and LIBSVM do. + * + * Currently, it's only enabled in LogisticRegressionWithLBFGS + */ +val scaler = if (useFeatureScaling) { + (new StandardScaler).fit(input.map(x = x.features)) +} else { + null +} + // Prepend an extra variable consisting of all 1.0's for the intercept. val data = if (addIntercept) { - input.map(labeledPoint = (labeledPoint.label, appendBias(labeledPoint.features))) + if(useFeatureScaling) { +input.map(labeledPoint = + (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features + } else { +input.map(labeledPoint = (labeledPoint.label, appendBias(labeledPoint.features))) + } } else { - input.map(labeledPoint = (labeledPoint.label, labeledPoint.features)) + if (useFeatureScaling) { +input.map(labeledPoint = (labeledPoint.label, scaler.transform(labeledPoint.features))) + } else { +input.map(labeledPoint = (labeledPoint.label, labeledPoint.features)) --- End diff -- It's not identical map. It's converting labeledPoint to tuple of response and feature vector for optimizer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor change in the comment of spark-defaults....
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/2709 Minor change in the comment of spark-defaults.conf.template spark-defaults.conf is used in spark-shell as well, and this PR added this into the comment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2709.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2709 commit b3e1ff1b808380707d04277c2379bf5b03556662 Author: DB Tsai dbt...@alpinenow.com Date: 2014-10-08T08:53:25Z add spark-shell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3121] Wrong implementation of implicit ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2712#issuecomment-58361701 Jenkins, please start the test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3856][MLLIB] use norm operator after br...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2718#issuecomment-58435304 LGTM Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3121] Wrong implementation of implicit ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2712#issuecomment-58629065 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3121] Wrong implementation of implicit ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2712#issuecomment-58732030 It's failing at FlumeStreamSuite.scala:109 which seems to be unrelated to this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor change in the comment of spark-defaults....
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2709#issuecomment-59667207 @andrewor14 Sorry for late reply since I was on vacation in Europe last week. I can continue work on this after I finish my talk in IOTA conf tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/2868#issuecomment-59871504 Jenkins, please start the test! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-60813678 @BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/2992 [SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop through the nonZero elements in the vector. However, activeIterator doesn't perform well due to lots of overhead. In this PR, native while loop is used for both DenseVector and SparseVector. The benchmark result with 20 executors using mnist8m dataset: Before: DenseVector: 48.2 seconds SparseVector: 16.3 seconds After: DenseVector: 17.8 seconds SparseVector: 11.2 seconds Since MultivariateOnlineSummarizer is used in several places, the overall performance gain in mllib library will be significant with this PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark SPARK-4129 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2992.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2992 commit ebe3e74df70eb424aecc3170fc55008cfb6a76ec Author: DB Tsai dbt...@alpinenow.com Date: 2014-10-29T05:42:50Z First commit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1013#issuecomment-45551414 Tested in PivotalHD 1.1 Yarn 4 node cluster. With --addjars file:///somePath/to/jar, launching spark application works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1013#discussion_r13573544 --- Diff: yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -507,12 +508,19 @@ object Client { Apps.addToEnvironment(env, Environment.CLASSPATH.name, Environment.PWD.$() + Path.SEPARATOR + LOG4J_PROP) } + +val cachedSecondaryJarLinks = + sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse().split(,) --- End diff -- Thanks. You are right. It will add empty string to array, and then add the folder without file into classpath. Will fix in master as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Make sure that empty string is filtered out wh...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1027 Make sure that empty string is filtered out when we get the secondary jars from conf You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-classloader Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1027.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1027 commit c9c7ad7fc6a2cf03503fe7b19ea1da92247196c6 Author: DB Tsai dbt...@dbtsai.com Date: 2014-06-10T01:29:04Z Make sure that empty string is filtered out when we get the secondary jars from conf. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/490#discussion_r13624385 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -95,15 +96,18 @@ trait ClientBase extends Logging { // If we have requested more then the clusters max for a single resource then exit. if (args.executorMemory maxMem) { - logError(Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster.. -format(args.executorMemory, maxMem)) - System.exit(1) + val errorMessage = +Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster.. +format(args.executorMemory, maxMem) + logError(errorMessage) + throw new IllegalArgumentException(errorMessage) } val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD if (amMem maxMem) { - logError(Required AM memory (%d) is above the max threshold (%d) of this cluster. -format(args.amMemory, maxMem)) - System.exit(1) + val errorMessage =Required AM memory (%d) is above the max threshold (%d) of this cluster. --- End diff -- Please add a space after = --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/490#discussion_r13624580 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -95,15 +96,18 @@ trait ClientBase extends Logging { // If we have requested more then the clusters max for a single resource then exit. if (args.executorMemory maxMem) { - logError(Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster.. -format(args.executorMemory, maxMem)) - System.exit(1) + val errorMessage = +Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster.. +format(args.executorMemory, maxMem) --- End diff -- Move the . to the new line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/490#discussion_r13624615 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -95,15 +96,18 @@ trait ClientBase extends Logging { // If we have requested more then the clusters max for a single resource then exit. if (args.executorMemory maxMem) { - logError(Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster.. -format(args.executorMemory, maxMem)) - System.exit(1) + val errorMessage = +Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster.. +format(args.executorMemory, maxMem) + logError(errorMessage) + throw new IllegalArgumentException(errorMessage) } val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD if (amMem maxMem) { - logError(Required AM memory (%d) is above the max threshold (%d) of this cluster. -format(args.amMemory, maxMem)) - System.exit(1) + val errorMessage =Required AM memory (%d) is above the max threshold (%d) of this cluster. --- End diff -- move the . to the newline --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/490#issuecomment-45835283 @mengxr Do you think it's in good shape now? This is the only issue blocking us using vanilla spark. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1104#discussion_r13897737 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -38,10 +38,10 @@ import org.apache.spark.mllib.linalg.{Vectors, Vector} class LBFGS(private var gradient: Gradient, private var updater: Updater) extends Optimizer with Logging { - private var numCorrections = 10 - private var convergenceTol = 1E-4 - private var maxNumIterations = 100 - private var regParam = 0.0 + private var numCorrections: Int = 10 + private var convergenceTol: Double = 1E-4 + private var maxNumIterations: Int = 100 + private var regParam: Double = 0.0 --- End diff -- In most of the mllib codebase, we don't specify the type of variable. Can you remove them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1104#discussion_r13897825 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala --- @@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with LocalSparkContext with Matchers { assert(lossLBFGS3.length == 6) assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) convergenceTol) } + --- End diff -- The bug isn't found because we only test the static runLBFGS method instead of the class. We probably can change all the existing tests to use the one in class, so we don't need to add another test. @mengxr what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1104#issuecomment-46393840 I think it's legacy reason to have two different way to access the API. As far as I know, @mengxr is working on consolidating the interface. He probably can talk about more on this topic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1104#discussion_r13905548 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala --- @@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with LocalSparkContext with Matchers { assert(lossLBFGS3.length == 6) assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4) convergenceTol) } + --- End diff -- We may add the same test to SGD as well. My bad. Our internal one is right. Probably when I copy and paste, I don't do thing right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1104#issuecomment-46412293 I think it will be a problem for MIMA to change the signature. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1207 SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation of a vector. It contains two methods, `apply` which applies transformation on a vector and `unapply` which applies inverse transformation on a vector. There are three concrete implementations of `VectorTransformer`, and they all can be easily extended with PMML transformation support. 1) `VectorStandardizer` - Standardises a vector given the mean and variance. Since the standardization will densify the output, the output is always in dense vector format. 2) `VectorRescaler` - Rescales a vector into target range specified by a tuple of two double values or two vectors as new target minimum and maximum. Since the rescaling will substrate the minimum of each column first, the output will always be in dense vector regardless of input vector type. 3) `VectorDivider` - Transforms a vector by dividing a constant or diving a vector with element by element basis. This transformation will preserve the type of input vector without densifying the result. Utility helper methods are implemented for taking an input of RDD[Vector], and then transformed RDD[Vector] and transformer are returned for dividing, rescaling, normalization, and standardization. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-feature-scaling Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1207 commit d3daa997c9a51a4af8f67cbcdb3738e5ba8c4b56 Author: DB Tsai dbt...@alpinenow.com Date: 2014-06-25T02:30:16Z Feature scaling which standardizes the range of independent variables or features of data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2281 [MLlib] Simplify the duplicate code...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1215 SPARK-2281 [MLlib] Simplify the duplicate code in Gradient.scala The Gradient.compute which returns new tuple of (gradient: Vector, loss: Double) can be constructed by in-place version of Gradient.compute. Thus, we don't need to maintain the duplicate code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-gradient-simplification Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1215.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1215 commit b2595d334c0d6246fe904b8c00ca3d51dc88f71a Author: DB Tsai dbt...@alpinenow.com Date: 2014-06-25T22:08:30Z Simplify the gradient --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1099#issuecomment-47250277 Seems that the jenkins is missing the python runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-47683286 We benchmarked treeReduce in our random forest implementation, and since the trees generated from each partition are fairly large (more than 100MB), we found that treeReduce can significantly reduce the shuffle time from 6mins to 2mins. Nice work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Upgrade junit_xml_listener to 0.5.1 which fixe...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1333 Upgrade junit_xml_listener to 0.5.1 which fixes the following issues 1) fix the class name to be fully qualified classpath 2) make sure the the reporting time is in second not in miliseond, which causing JUnit HTML to report incorrect number 3) make sure the duration of the tests are accumulative. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-junit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1333.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1333 commit bbeac4b1bb8635eec2b046f1c4cfd15b64d0 Author: DB Tsai dbt...@alpinenow.com Date: 2014-07-08T18:44:47Z Upgrade junit_xml_listener to 0.5.1 which fixes the following issues 1) fix the class name to be fully qualified classpath 2) make sure the the reporting time is in second not in miliseond, which causing JUnit HTML to report incorrect number 3) make sure the duration of the tests are accumulative. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Upgrade junit_xml_listener to 0.5.1 which fixe...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1333#issuecomment-48417558 done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2281 [MLlib] Simplify the duplicate code...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/1215 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/955#discussion_r14796461 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala --- @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat + +import breeze.linalg.{DenseVector = BDV} + +import org.apache.spark.mllib.linalg.{Vectors, Vector} +import org.apache.spark.annotation.DeveloperApi + +/** + * :: DeveloperApi :: + * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to compute the mean, variance, + * minimum, maximum, counts, and non-zero counts for samples in sparse or dense vector format in + * a streaming fashion. + * + * Two OnlineSummarizers can be merged together to have a statistical summary of a jointed dataset. + * + * A numerically stable algorithm is implemented to compute sample mean and variance: + * Reference: [[http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance variance-wiki]] + * Zero elements (including explicit zero values) are skipped when calling add(), + * to have time complexity O(nnz) instead of O(n) for each column. + */ +@DeveloperApi +class OnlineSummarizer extends MultivariateStatisticalSummary with Serializable { --- End diff -- I actually want to change MultivariateStatisticalSummary to StatisticalSummary since it's too verbose. But for consistency, I will change it to MultivariateOnlineSummarizer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/987#issuecomment-48762832 #560 is merged. Close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/987 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1379 [SPARK-2309][MLlib] Generalize the binary logistic regression into multinomial logistic regression Currently, there is no multi-class classifier in mllib. Logistic regression can be extended to multinomial classifier straightforwardly. The following formula will be implemented. http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25 Note: When multi-classes mode, there will be multiple intercepts, so we don't use the single intercept in `GeneralizedLinearModel`, and have all the intercepts into weights. It makes some inconsistency. For example, in the binary mode, the intercept can not be specified by users, but since in the multinomial mode, the intercepts are combined into weights, users can specify them. @mengxr Should we just deprecate the intercept, and have everything in weights? It makes sense in term of optimization point of view, and also make the interface cleaner. Thanks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-mlor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1379.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1379 commit 82dae74135bafa5d1adeef4b2b421693c05b2778 Author: DB Tsai dbt...@alpinenow.com Date: 2014-06-27T21:47:15Z Multinomial Logistic Regression --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2477][MLlib] Using appendBias for addin...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1410 [SPARK-2477][MLlib] Using appendBias for adding intercept in GeneralizedLinearAlgorithm Instead of using prependOne currently in GeneralizedLinearAlgorithm, we would like to use appendBias for 1) keeping the indices of original training set unchanged by adding the intercept into the last element of vector and 2) using the same public API for consistently adding intercept. You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark SPARK-2477_intercept_with_appendBias Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1410.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1410 commit 011432cd2f815aacd9b12e770e5c6ec16ea716aa Author: DB Tsai dbt...@alpinenow.com Date: 2014-07-14T22:04:01Z From Alpine Data Labs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/1425 [SPARK-2479][MLlib] Comparing floating-point numbers using relative error in UnitTests Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored. See the following famous article for detail. http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ For example: float a = 0.15 + 0.15 float b = 0.1 + 0.2 if(a == b) // can be false! if(a = b) // can also be false! (ps, not all the tests involving floating point comparisons are changed to use almostEquals) You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark SPARK-2479_comparing_floating_point Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1425.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1425 commit f4da8f4f8693763b4823e36e3d270b74a7ce67bf Author: DB Tsai dbt...@alpinenow.com Date: 2014-07-14T23:24:11Z Alpine Data Labs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1425#discussion_r15013544 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala --- @@ -81,9 +82,8 @@ class LogisticRegressionSuite extends FunSuite with LocalSparkContext with Match val model = lr.run(testRDD) // Test the weights -val weight0 = model.weights(0) -assert(weight0 = -1.60 weight0 = -1.40, weight0 + not in [-1.6, -1.4]) -assert(model.intercept = 1.9 model.intercept = 2.1, model.intercept + not in [1.9, 2.1]) +assert(model.weights(0).almostEquals(-1.5244128696247), weight0 should be -1.5244128696247) --- End diff -- We can have higher relative error here instead. If the implementation is changed, it's also nice to have a test which can catch the slightly different behavior. Also, updating those numbers will not take too much time comparing with the implementation work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1425#discussion_r15013786 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala --- @@ -20,8 +20,20 @@ package org.apache.spark.mllib.evaluation import org.scalatest.FunSuite import org.apache.spark.mllib.util.LocalSparkContext +import org.apache.spark.mllib.util.TestingUtils._ class BinaryClassificationMetricsSuite extends FunSuite with LocalSparkContext { + + implicit class SeqDoubleWithAlmostEquals(val x: Seq[Double]) { +def almostEquals(y: Seq[Double], eps: Double = 1E-6): Boolean = --- End diff -- Yeah, for one ulp, it might be 10e-15. Lots of time, I manually type the numbers or just copy the first couple dights of numbers to save the line space, so that's why I chose 1.0e-6. Thus, I can just type around 7 digits of numbers. I agree with you that in this case, we may want to explicitly specify with larger epsilon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49221370 @mengxr Scalatest 2.x has the tolerance feature, but it's absolute error not relative error. For large numbers, the absolute error may not be meaningful. With `===`, it will return false even the different is only one unit of least precision (ULP), and it often happens when running the unittest under different architecture of machine. For example, ARM and X86 may have different numerical rounding , and we don't run any test other than X86. C++ boost has their numerical `===` test with the relative error for this reason. I probably can add method called `~=` and `~==` method for `Double`, and `Vector` type using implicit class, and `~==` will raise the exception for the message purpose like `===` does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49222983 I learn `almostEquals` from boost library. Anyway, in this case, how do we distinguish the one with throwing out the message, and the one just returning true/false? `almostEquals` and `almostEqualsWithMessage`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49253108 @mengxr and @srowen What do you think `assert((0.0001 !~== 0.0) +- 1E-5)`? We have `~==` and `~==` which will have the error message in the latest commit from my co-worker. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1157 L-BFGS Optimizer based on Breeze L-...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/53 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/353 SPARK-1157: L-BFGS Optimizer based on Breeze's implementation. This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr ! When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way. Let's review how updater works when returning newWeights given the input parameters. w' = w - thisIterStepSize * (gradient + regGradient(w)) Note that regGradient is function of w! If we set gradient = 0, thisIterStepSize = 1, then regGradient(w) = w - w' As a result, for regVal, it can be computed by val regVal = updater.compute( weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2 and for regGradient, it can be obtained by val regGradient = weights.sub( updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1) The PR includes the tests which compare the result with SGD with/without regularization. We did a comparison between LBFGS and SGD, and often we saw 10x less steps in LBFGS while the cost of per step is the same (just computing the gradient). The following is the paper by Prof. Ng at Stanford comparing different optimizers including LBFGS and SGD. They use them in the context of deep learning, but worth as reference. http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-LBFGS Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/353.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #353 commit 60c83350bb77aa640edd290a26e2a20281b7a3a8 Author: DB Tsai dbt...@dbtsai.com Date: 2014-04-05T00:06:50Z L-BFGS Optimizer based on Breeze's implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11404094 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.Array +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +/** + * Class used to solve an optimization problem using Limited-memory BFGS. + * @param gradient Gradient function to be used. + * @param updater Updater to be used to update weights after every iteration. + */ +class LBFGS(var gradient: Gradient, var updater: Updater) + extends Optimizer with Logging +{ + private var numCorrections: Int = 10 + private var lineSearchTolerance: Double = 0.9 + private var convTolerance: Double = 1E-4 + private var maxNumIterations: Int = 100 + private var regParam: Double = 0.0 + private var miniBatchFraction: Double = 1.0 + + /** + * Set the number of corrections used in the LBFGS update. Default 10. + * Values of m less than 3 are not recommended; large values of m + * will result in excessive computing time. 3 m 10 is recommended. + * Restriction: m 0 + */ + def setNumCorrections(corrections: Int): this.type = { +assert(corrections 0) +this.numCorrections = corrections +this + } + + /** + * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9. + * If the function and gradient evaluations are inexpensive with respect to the cost of + * the iteration (which is sometimes the case when solving very large problems) it may + * be advantageous to set to a small value. A typical small value is 0.1. + * Restriction: should be greater than 1e-4. + */ + def setLineSearchTolerance(tolerance: Double): this.type = { +this.lineSearchTolerance = tolerance +this + } + + /** + * Set fraction of data to be used for each L-BFGS iteration. Default 1.0. + */ + def setMiniBatchFraction(fraction: Double): this.type = { +this.miniBatchFraction = fraction +this + } + + /** + * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4. + * Smaller value will lead to higher accuracy with the cost of more iterations. + */ + def setConvTolerance(tolerance: Int): this.type = { +this.convTolerance = tolerance +this + } + + /** + * Set the maximal number of iterations for L-BFGS. Default 100. + */ + def setMaxNumIterations(iters: Int): this.type = { +this.maxNumIterations = iters +this + } + + /** + * Set the regularization parameter. Default 0.0. + */ + def setRegParam(regParam: Double): this.type = { +this.regParam = regParam +this + } + + /** + * Set the gradient function (of the loss function of one single data example) + * to be used for L-BFGS. + */ + def setGradient(gradient: Gradient): this.type = { +this.gradient = gradient +this + } + + /** + * Set the updater function to actually perform a gradient step in a given direction. + * The updater is responsible to perform the update from the regularization term as well, + * and therefore determines what kind or regularization is used, if any. + */ + def setUpdater(updater: Updater): this.type = { +this.updater = updater +this + } + + def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = { +val (weights, _) = LBFGS.runMiniBatchLBFGS( + data
[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-39895140 @mengxr As you suggested, I moved the costFun to private CostFun class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11460767 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.Array +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +/** + * Class used to solve an optimization problem using Limited-memory BFGS. + * @param gradient Gradient function to be used. + * @param updater Updater to be used to update weights after every iteration. + */ +class LBFGS(var gradient: Gradient, var updater: Updater) + extends Optimizer with Logging +{ + private var numCorrections: Int = 10 --- End diff -- @mengxr I know. I pretty much follow the existing coding style in GradientDescent.scala Should I also change the one in other place? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11461398 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.Array +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +/** + * Class used to solve an optimization problem using Limited-memory BFGS. + * @param gradient Gradient function to be used. + * @param updater Updater to be used to update weights after every iteration. + */ +class LBFGS(var gradient: Gradient, var updater: Updater) + extends Optimizer with Logging +{ + private var numCorrections: Int = 10 + private var lineSearchTolerance: Double = 0.9 + private var convTolerance: Double = 1E-4 + private var maxNumIterations: Int = 100 + private var regParam: Double = 0.0 + private var miniBatchFraction: Double = 1.0 + + /** + * Set the number of corrections used in the LBFGS update. Default 10. + * Values of m less than 3 are not recommended; large values of m + * will result in excessive computing time. 3 m 10 is recommended. + * Restriction: m 0 + */ + def setNumCorrections(corrections: Int): this.type = { +assert(corrections 0) +this.numCorrections = corrections +this + } + + /** + * Set the tolerance to control the accuracy of the line search in mcsrch step. Default 0.9. + * If the function and gradient evaluations are inexpensive with respect to the cost of + * the iteration (which is sometimes the case when solving very large problems) it may + * be advantageous to set to a small value. A typical small value is 0.1. + * Restriction: should be greater than 1e-4. + */ + def setLineSearchTolerance(tolerance: Double): this.type = { --- End diff -- Good catch! It's used in RISO implementation. Just remove them. Thks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11463764 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite +import org.scalatest.matchers.ShouldMatchers + +import org.apache.spark.SparkContext +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers { + @transient private var sc: SparkContext = _ + var dataRDD:RDD[(Double, Vector)] = _ + + val nPoints = 1 + val A = 2.0 + val B = -1.5 + + val initialB = -1.0 + val initialWeights = Array(initialB) + + val gradient = new LogisticGradient() + val numCorrections = 10 + val lineSearchTolerance = 0.9 + var convTolerance = 1e-12 + var maxNumIterations = 10 + val miniBatchFrac = 1.0 + + val simpleUpdater = new SimpleUpdater() + val squaredL2Updater = new SquaredL2Updater() + + // Add a extra variable consisting of all 1.0's for the intercept. + val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42) + val data = testData.map { case LabeledPoint(label, features) = +label - Vectors.dense(1.0, features.toArray: _*) + } + + override def beforeAll() { +sc = new SparkContext(local, test) +dataRDD = sc.parallelize(data, 2).cache() + } + + override def afterAll() { +sc.stop() +System.clearProperty(spark.driver.port) + } + + def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = { +math.abs(x - y) / (math.abs(y) + 1e-15) tol + } + + test(Assert LBFGS loss is decreasing and matches the result of Gradient Descent.) { +val updater = new SimpleUpdater() +val regParam = 0 + +val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*) + +val (_, loss) = LBFGS.runMiniBatchLBFGS( + dataRDD, + gradient, + updater, + numCorrections, + lineSearchTolerance, + convTolerance, + maxNumIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +assert(loss.last - loss.head 0, loss isn't decreasing.) + +val lossDiff = loss.init.zip(loss.tail).map { + case (lhs, rhs) = lhs - rhs +} +assert(lossDiff.count(_ 0).toDouble / lossDiff.size 0.8) --- End diff -- This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should at least have the same performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11464280 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite +import org.scalatest.matchers.ShouldMatchers + +import org.apache.spark.SparkContext +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers { + @transient private var sc: SparkContext = _ + var dataRDD:RDD[(Double, Vector)] = _ + + val nPoints = 1 + val A = 2.0 + val B = -1.5 + + val initialB = -1.0 + val initialWeights = Array(initialB) + + val gradient = new LogisticGradient() + val numCorrections = 10 + val lineSearchTolerance = 0.9 + var convTolerance = 1e-12 + var maxNumIterations = 10 + val miniBatchFrac = 1.0 + + val simpleUpdater = new SimpleUpdater() + val squaredL2Updater = new SquaredL2Updater() + + // Add a extra variable consisting of all 1.0's for the intercept. + val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42) + val data = testData.map { case LabeledPoint(label, features) = +label - Vectors.dense(1.0, features.toArray: _*) + } + + override def beforeAll() { +sc = new SparkContext(local, test) +dataRDD = sc.parallelize(data, 2).cache() + } + + override def afterAll() { +sc.stop() +System.clearProperty(spark.driver.port) + } + + def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = { +math.abs(x - y) / (math.abs(y) + 1e-15) tol + } + + test(Assert LBFGS loss is decreasing and matches the result of Gradient Descent.) { +val updater = new SimpleUpdater() +val regParam = 0 + +val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*) + +val (_, loss) = LBFGS.runMiniBatchLBFGS( + dataRDD, + gradient, + updater, + numCorrections, + lineSearchTolerance, + convTolerance, + maxNumIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +assert(loss.last - loss.head 0, loss isn't decreasing.) + +val lossDiff = loss.init.zip(loss.tail).map { + case (lhs, rhs) = lhs - rhs +} +assert(lossDiff.count(_ 0).toDouble / lossDiff.size 0.8) + +val stepSize = 1.0 +// Well, GD converges slower, so it requires more iterations! +val numGDIterations = 50 +val (_, lossGD) = GradientDescent.runMiniBatchSGD( + dataRDD, + gradient, + updater, + stepSize, + numGDIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +assert(Math.abs((lossGD.last - loss.last) / loss.last) 0.05, + LBFGS should match GD result within 5% error.) + } + + test(Assert that LBFGS and Gradient Descent with L2 regularization get the same result.) { +val regParam = 0.2 + +// Prepare another non-zero weights to compare the loss in the first iteration. +val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12) + +val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS( + dataRDD, + gradient, + squaredL2Updater, + numCorrections, + lineSearchTolerance, + convTolerance, + maxNumIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +// With regularization, GD converges faster now! +// So we only need 20 iterations to get
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11464736 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite +import org.scalatest.matchers.ShouldMatchers + +import org.apache.spark.SparkContext +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +class LBFGSSuite extends FunSuite with BeforeAndAfterAll with ShouldMatchers { + @transient private var sc: SparkContext = _ + var dataRDD:RDD[(Double, Vector)] = _ + + val nPoints = 1 + val A = 2.0 + val B = -1.5 + + val initialB = -1.0 + val initialWeights = Array(initialB) + + val gradient = new LogisticGradient() + val numCorrections = 10 + val lineSearchTolerance = 0.9 + var convTolerance = 1e-12 + var maxNumIterations = 10 + val miniBatchFrac = 1.0 + + val simpleUpdater = new SimpleUpdater() + val squaredL2Updater = new SquaredL2Updater() + + // Add a extra variable consisting of all 1.0's for the intercept. + val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42) + val data = testData.map { case LabeledPoint(label, features) = +label - Vectors.dense(1.0, features.toArray: _*) + } + + override def beforeAll() { +sc = new SparkContext(local, test) +dataRDD = sc.parallelize(data, 2).cache() + } + + override def afterAll() { +sc.stop() +System.clearProperty(spark.driver.port) + } + + def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = { +math.abs(x - y) / (math.abs(y) + 1e-15) tol + } + + test(Assert LBFGS loss is decreasing and matches the result of Gradient Descent.) { +val updater = new SimpleUpdater() +val regParam = 0 + +val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: _*) + +val (_, loss) = LBFGS.runMiniBatchLBFGS( + dataRDD, + gradient, + updater, + numCorrections, + lineSearchTolerance, + convTolerance, + maxNumIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +assert(loss.last - loss.head 0, loss isn't decreasing.) + +val lossDiff = loss.init.zip(loss.tail).map { + case (lhs, rhs) = lhs - rhs +} +assert(lossDiff.count(_ 0).toDouble / lossDiff.size 0.8) + +val stepSize = 1.0 +// Well, GD converges slower, so it requires more iterations! +val numGDIterations = 50 +val (_, lossGD) = GradientDescent.runMiniBatchSGD( + dataRDD, + gradient, + updater, + stepSize, + numGDIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +assert(Math.abs((lossGD.last - loss.last) / loss.last) 0.05, + LBFGS should match GD result within 5% error.) + } + + test(Assert that LBFGS and Gradient Descent with L2 regularization get the same result.) { +val regParam = 0.2 + +// Prepare another non-zero weights to compare the loss in the first iteration. +val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12) + +val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS( + dataRDD, + gradient, + squaredL2Updater, + numCorrections, + lineSearchTolerance, + convTolerance, + maxNumIterations, + regParam, + miniBatchFrac, + initialWeightsWithIntercept) + +// With regularization, GD converges faster now! +// So we only need 20 iterations to get
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11605070 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV, axpy} +import breeze.optimize.{CachedDiffFunction, DiffFunction} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +/** + * Class used to solve an optimization problem using Limited-memory BFGS. + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]] + * @param gradient Gradient function to be used. + * @param updater Updater to be used to update weights after every iteration. + */ +class LBFGS(private var gradient: Gradient, private var updater: Updater) + extends Optimizer with Logging { + + private var numCorrections = 10 + private var convergenceTol = 1E-4 + private var maxNumIterations = 100 + private var regParam = 0.0 + private var miniBatchFraction = 1.0 + + /** + * Set the number of corrections used in the LBFGS update. Default 10. + * Values of numCorrections less than 3 are not recommended; large values + * of numCorrections will result in excessive computing time. + * 3 numCorrections 10 is recommended. + * Restriction: numCorrections 0 + */ + def setNumCorrections(corrections: Int): this.type = { +assert(corrections 0) +this.numCorrections = corrections +this + } + + /** + * Set fraction of data to be used for each L-BFGS iteration. Default 1.0. + */ + def setMiniBatchFraction(fraction: Double): this.type = { +this.miniBatchFraction = fraction +this + } + + /** + * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4. + * Smaller value will lead to higher accuracy with the cost of more iterations. + */ + def setConvergenceTol(tolerance: Int): this.type = { +this.convergenceTol = tolerance +this + } + + /** + * Set the maximal number of iterations for L-BFGS. Default 100. + */ + def setMaxNumIterations(iters: Int): this.type = { +this.maxNumIterations = iters +this + } + + /** + * Set the regularization parameter. Default 0.0. + */ + def setRegParam(regParam: Double): this.type = { +this.regParam = regParam +this + } + + /** + * Set the gradient function (of the loss function of one single data example) + * to be used for L-BFGS. + */ + def setGradient(gradient: Gradient): this.type = { +this.gradient = gradient +this + } + + /** + * Set the updater function to actually perform a gradient step in a given direction. + * The updater is responsible to perform the update from the regularization term as well, + * and therefore determines what kind or regularization is used, if any. + */ + def setUpdater(updater: Updater): this.type = { +this.updater = updater +this + } + + override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = { +val (weights, _) = LBFGS.runMiniBatchLBFGS( + data, + gradient, + updater, + numCorrections, + convergenceTol, + maxNumIterations, + regParam, + miniBatchFraction, + initialWeights) +weights + } + +} + +/** + * Top-level method to run LBFGS. + */ +object LBFGS extends Logging { + /** + * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches. + * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data + * in order
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/353 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434555 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
GitHub user dbtsai reopened a pull request: https://github.com/apache/spark/pull/353 [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation. This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr ! When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way. Let's review how updater works when returning newWeights given the input parameters. w' = w - thisIterStepSize * (gradient + regGradient(w)) Note that regGradient is function of w! If we set gradient = 0, thisIterStepSize = 1, then regGradient(w) = w - w' As a result, for regVal, it can be computed by val regVal = updater.compute( weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2 and for regGradient, it can be obtained by val regGradient = weights.sub( updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1) The PR includes the tests which compare the result with SGD with/without regularization. We did a comparison between LBFGS and SGD, and often we saw 10x less steps in LBFGS while the cost of per step is the same (just computing the gradient). The following is the paper by Prof. Ng at Stanford comparing different optimizers including LBFGS and SGD. They use them in the context of deep learning, but worth as reference. http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-LBFGS Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/353.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #353 commit 984b18e21396eae84656e15da3539ff3b5f3bf4a Author: DB Tsai dbt...@alpinenow.com Date: 2014-04-05T00:06:50Z L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434626 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434691 Timeout for lastest jenkins run. It seems that CI is not stable now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/353 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLlib doc update for breeze dependency
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/481 MLlib doc update for breeze dependency MLlib is now using breeze linear algebra library instead of jblas; this PR will update the doc to help users to install the blas native libraries to have better performance in netlib-java which breeze depends on. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-LBFGSdocs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/481.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #481 commit eddb3ddfd036035b4b8c639450e4d48db6afd4d4 Author: DB Tsai dbt...@dbtsai.com Date: 2014-04-22T07:35:44Z Fixed MLlib doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLlib doc update for breeze dependency
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/481 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1506][MLLIB] Documentation improvements...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/422#discussion_r11841916 --- Diff: docs/mllib-guide.md --- @@ -3,63 +3,120 @@ layout: global title: Machine Learning Library (MLlib) --- +MLlib is a Spark implementation of some common machine learning algorithms and utilities, +including classification, regression, clustering, collaborative +filtering, dimensionality reduction, as well as underlying optimization primitives: -MLlib is a Spark implementation of some common machine learning (ML) -functionality, as well associated tests and data generators. MLlib -currently supports four common types of machine learning problem settings, -namely classification, regression, clustering and collaborative filtering, -as well as an underlying gradient descent optimization primitive and several -linear algebra methods. - -# Available Methods -The following links provide a detailed explanation of the methods and usage examples for each of them: - -* a href=mllib-classification-regression.htmlClassification and Regression/a - * Binary Classification -* SVM (L1 and L2 regularized) -* Logistic Regression (L1 and L2 regularized) - * Linear Regression -* Least Squares -* Lasso -* Ridge Regression - * Decision Tree (for classification and regression) -* a href=mllib-clustering.htmlClustering/a - * k-Means -* a href=mllib-collaborative-filtering.htmlCollaborative Filtering/a - * Matrix Factorization using Alternating Least Squares -* a href=mllib-optimization.htmlOptimization/a - * Gradient Descent and Stochastic Gradient Descent -* a href=mllib-linear-algebra.htmlLinear Algebra/a - * Singular Value Decomposition - * Principal Component Analysis - -# Data Types - -Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the -[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class is used to -represent vectors. You can create either dense or sparse vectors using the -[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) factory. - -In Python, MLlib can take the following vector types: - -* [NumPy](http://www.numpy.org) arrays -* Standard Python lists (e.g. `[1, 2, 3]`) -* The MLlib [SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class -* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html) - -For efficiency, we recommend using NumPy arrays over lists, and using the -[CSC format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) -for SciPy matrices, or MLlib's own SparseVector class. - -Several other simple data types are used throughout the library, e.g. the LabeledPoint -class ([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint), -[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data. - -# Dependencies -MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself -depends on native Fortran routines. You may need to install the -[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries) -if it is not already present on your nodes. MLlib will throw a linking error if it cannot -detect these libraries automatically. +* [Basics](mllib-basics.html) + * data types + * summary statistics +* Classification and regression + * [linear support vector machine (SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm) + * [logistic regression](mllib-linear-methods.html#logistic-regression) + * [linear least squares, Lasso, and ridge regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression) + * [decision tree](mllib-decision-tree.html) + * [naive Bayes](mllib-naive-bayes.html) +* [Collaborative filtering](mllib-collaborative-filtering.html) + * alternating least squares (ALS) +* [Clustering](mllib-clustering.html) + * k-means +* [Dimensionality reduction](mllib-dimensionality-reduction.html) + * singular value decomposition (SVD) + * principal component analysis (PCA) +* [Optimization](mllib-optimization.html) + * stochastic gradient descent + * limited-memory BFGS (L-BFGS) + +MLlib is currently a beta component under active development. +The APIs may be changed in the future releases, and we will provide migration guide between releases. + +## Dependencies + +MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which depends on +[netlib-java](https://github.com/fommil/netlib-java), and +[jblas](https
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/490#discussion_r11883381 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -77,7 +78,8 @@ trait ClientBase extends Logging { ).foreach { case(cond, errStr) = if (cond) { logError(errStr) -args.printUsageAndExit(1) +throw new IllegalArgumentException(args.getUsageMessage()) + --- End diff -- Remove this empty line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/490#issuecomment-41114289 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1897#issuecomment-52149162 Seems that Jenkins is not stable. Failing on issues related to akka. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1973#discussion_r16319946 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -69,8 +69,17 @@ class LBFGS(private var gradient: Gradient, private var updater: Updater) /** * Set the maximal number of iterations for L-BFGS. Default 100. + * @deprecated use [[setNumIterations()]] instead */ + @deprecated(use setNumIterations instead, 1.1.0) def setMaxNumIterations(iters: Int): this.type = { +this.setNumCorrections(iters) --- End diff -- Should it be this. setNumIterations(iters) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1973#issuecomment-52381503 LGTM. Merged into both master and branch-1.1. Thanks!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/2068 [SPARK-2841][MLlib] Documentation for feature transformations Documentation for newly added feature transformations: 1. TF-IDF 2. StandardScaler 3. Normalizer You can merge this pull request into a Git repository by running: $ git pull https://github.com/AlpineNow/spark transformer-documentation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2068.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2068 commit e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31 Author: DB Tsai dbt...@alpinenow.com Date: 2014-08-20T22:21:26Z documentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/2068#discussion_r16561045 --- Diff: docs/mllib-feature-extraction.md --- @@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) { /div /div -## TFIDF \ No newline at end of file +## TFIDF + +## StandardScaler + +Standardizes features by scaling to unit variance and/or removing the mean using column summary +statistics on the samples in the training set. For example, RBF kernel of Support Vector Machines +or the L1 and L2 regularized linear models typically assume that all features have unit variance +and/or zero mean. --- End diff -- How about I say For example, RBF kernel of Support Vector Machines or the L1 and L2 regularized linear models typically works better when all features have unit variance and/or zero mean. I actually have this statement from scikit documentation. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org