[GitHub] spark pull request: [SPARK-1892][MLLIB] Adding OWL-QN optimizer fo...

2014-10-01 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/840#issuecomment-57439459
  
@debasish83 and @codedeft The weighted method for OWLQN in breeze is merged 
https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c

I will submit a PR to Spark to use newer version of breeze with this 
feature once @dlwh publishes to this to maven. But there is still some work in 
mllib side to have it working properly. I'll work on this once I'm back from 
vacation.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3119] Re-implementation of TorrentBroad...

2014-10-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2030#issuecomment-58183559
  
We had a build against the spark master on Oct 2, and when ran our 
application with data around 600GB, we got the following exception. Does this 
PR fix this issue which is seen by @JoshRosen

Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, 
most recent failure: Lost task 0.3 in stage 6.0 (TID 8312, ams03-002.ff): 
java.io.IOException: PARSING_ERROR(2)
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)

org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)

org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)

org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)

org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)

org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1004)

org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)

org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)

org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)

org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)

org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
org.apache.spark.scheduler.Task.run(Task.scala:56)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3832][MLlib] Upgrade Breeze dependency ...

2014-10-07 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2693

[SPARK-3832][MLlib] Upgrade Breeze dependency to 0.10

In Breeze 0.10, the L1regParam can be configured through anonymous function 
in OWLQN, and each component can be penalized differently. This is required for 
GLMNET in MLlib with L1/L2 regularization.

https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark breeze0.10

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2693.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2693


commit 7a0c45cda7d388152774722a2f6728294cc81b4e
Author: DB Tsai dbt...@dbtsai.com
Date:   2014-10-07T14:20:41Z

In Breeze 0.10, the L1regParam can be configured through anonymous function 
in OWLQN, and each component can be penalized differently. This is required for 
GLMNET in MLlib with L1/L2 regularization.

https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3119] Re-implementation of TorrentBroad...

2014-10-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2030#issuecomment-58214186
  
I thought it was a close issue, so I moved my comment to JIRA. I ran into
this issue in spark-shell not the standalone application, does SPARK-3762
apply in this situation? Thanks.

Sent from my Google Nexus 5
On Oct 7, 2014 5:17 PM, Davies Liu notificati...@github.com wrote:

 It could be fixed by https://github.com/apache/spark/pull/2624

 It's strange that I can not see this comment on PR #2030.

 On Tue, Oct 7, 2014 at 6:28 AM, DB Tsai notificati...@github.com wrote:

  We had a build against the spark master on Oct 2, and when ran our
  application with data around 600GB, we got the following exception. Does
  this PR fix this issue which is seen by @JoshRosen
  https://github.com/JoshRosen
 
  Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 6.0 (TID 8312, ams03-002.ff):
 java.io.IOException: PARSING_ERROR(2)
  org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
  org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
  org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
  
org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
  
org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
  org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
 
 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
 
 
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1004)
 
 
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
 
 
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
 
 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
 
 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
 
 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
  org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  org.apache.spark.scheduler.Task.run(Task.scala:56)
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:744)
 
  Driver stacktrace:
 
  --
  Reply to this email directly or view it on GitHub
  https://github.com/apache/spark/pull/2030#issuecomment-58183559.
 



 --
 - Davies

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2030#issuecomment-58201237.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3832][MLlib] Upgrade Breeze dependency ...

2014-10-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2693#issuecomment-58276308
  
@dlwh David, do you know if there is dependency change in breeze-0.10 and 
is it compatible with both scala 2.10 and 2.11? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...

2014-07-21 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1518

[SPARK-2505][MLlib] Weighted Regularizer for Generalized Linear Model 

(Note: This is not ready to be merged. Need documentation, and make sure 
it's backforwad compatible with Spark 1.0 apis). 

The current implementation of regularization in linear model is using 
`Updater`, and this design has couple issues as the following.
1) It will penalize all the weights including intercept. In machine 
learning training process, typically, people don't penalize the intercept. 
2) The `Updater` has the logic of adaptive step size for gradient decent, 
and we would like to clean it up by separating the logic of regularization out 
from updater to regularizer so in LBFGS optimizer, we don't need the trick for 
getting the loss and gradient of objective function.
In this work, a weighted regularizer will be implemented, and users can 
exclude the intercept or any weight from regularization by setting that term 
with zero weighted penalty. Since the regularizer will return a tuple of loss 
and gradient, the adaptive step size logic, and soft thresholding for L1 in 
Updater will be moved to SGD optimizer.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark SPARK-2505_regularizer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1518


commit 2946930ec3de0e0a34e07d065c954d7aabacd4ba
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-07-19T02:15:37Z

initial work




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-49682150
  
I think it fails due to the apache license is not in the test file. As you 
suggest, I'll move it to be generated in the runtime. Would like to know the 
general feedback. I'll make the test pass tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49682436
  
`!~==` will be used in the test since `!(a~==b)` will not work due to that 
(a~==b) is not returning false but throwing exception for messaging. I will 
replace the almostEquals with `~==`. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-23 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49954543
  
@srowen @mengxr and @dorx

Based on our discussion, I've implemented two different APIs for relative 
error, and absolute error. It makes sense that test writers should know which 
one they need depending on their circumstances. 

Developers also need to explicitly specify the eps now, and there is no 
default value which will sometimes cause confusion. 

When comparing against zero using relative error, a exception will be 
raised to warn users that it's meaningless.

For relative error in percentage, users can now write 

assert(23.1 ~== 23.52 %+- 2.0)
assert(23.1 ~== 22.74 %+- 2.0)
assert(23.1 ~= 23.52 %+- 2.0)
assert(23.1 ~= 22.74 %+- 2.0)
assert(!(23.1 !~= 23.52 %+- 2.0))
assert(!(23.1 !~= 22.74 %+- 2.0))

// This will throw exception with the following message.
// Did not expect 23.1 and 23.52 to be within 2.0% using relative 
error.
assert(23.1 !~== 23.52 %+- 2.0)

// Expected 23.1 and 22.34 to be within 2.0% using relative error.
assert(23.1 ~== 22.34 %+- 2.0)
  
For absolute error, 

assert(17.8 ~== 17.99 +- 0.2)
assert(17.8 ~== 17.61 +- 0.2)
assert(17.8 ~= 17.99 +- 0.2)
assert(17.8 ~= 17.61 +- 0.2)
assert(!(17.8 !~= 17.99 +- 0.2))
assert(!(17.8 !~= 17.61 +- 0.2))

// This will throw exception with the following message.
// Did not expect 17.8 and 17.99 to be within 0.2 using absolute 
error.
assert(17.8 !~== 17.99 +- 0.2)
 
// Expected 17.8 and 17.59 to be within 0.2 using absolute error.
assert(17.8 ~== 17.59 +- 0.2)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479 (partial)][MLLIB] fix binary metri...

2014-07-24 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1576#issuecomment-50057950
  
@mengxr Feel free to merge this one first. After you merge, I'll rebase 
#1425 against current master, and address the conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-24 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-50064963
  
@mengxr `%+-` is used as an operator to indicate this is relative error. 
Users can write `assert(a ~== b %+- 1E-10)` for relative error, and `assert(a 
~== b +- 1E-10)` for absolute error. 

As a result, the syntactic sugar would be the same as scalatest for 
absolute error except they use `===` instead of `~==`. 

On the other hand, however, using `absErr`/`relErr` seems to be easier to 
remember. I'm open to both, and it's easy to change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-24 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-50081864
  
@mengxr I just rebased against master, and it passes the test. Depending on 
whether we want to use `absErr`/`relErr`, `+-`/`%+-` or both, I can do further 
modification. Tks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r15443103
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -40,27 +41,51 @@ class KMeansSuite extends FunSuite with 
LocalSparkContext {
 // No matter how many runs or iterations we use, we should get one 
cluster,
 // centered at the mean of the points
 
+ HEAD
--- End diff --

Tried to rebase against master with conflicts. I addressed them in the next 
push. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-27 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-50293096
  
@mengxr Resolved all the conflicts after rebasing, and all the unittests 
are passed. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...

2014-07-30 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1518#issuecomment-50663418
  
I tried to make the bias really big to make the intercept smaller to avoid 
being regularized. The result is still quite different from R, and very 
sensitive to the strength of bias.

Users may re-scale the features to improve the convergence of optimization 
process, and in order to get the same coefficients without scaling, each 
component has to be penalized differently. Also, users may know which feature 
is less important, and want to penalize more. 

As a result, I still want to implement the full weighted regualizer, and 
de-couple the adaptive learning rate from updater. Let's talk in detail when we 
meet tomorrow. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-08-02 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-50982699
  
@mengxr  Is there any problem with asfgit? This is not finished yet, why 
asfgit said it's merged into apache:master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733217
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n  0)
--- End diff --

This is Int. As long as we require p  0; it implies p = 0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^n norm
+ *
+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {
+
+  def this() = this(2)
+
+  require(n  0)
--- End diff --

I made it more explicit for not saving one cpu cycle. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733244
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector = BDV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Standardizes features by removing the mean and scaling to unit variance 
using column summary
+ * statistics on the samples in the training set.
+ *
+ * @param withMean True by default. Centers the data with mean before 
scaling. It will build a dense
+ * output, so this does not work on sparse input and will 
raise an exception.
+ * @param withStd True by default. Scales the data to unit standard 
deviation.
--- End diff --

sklearn.preprocessing.StandardScaler has this API. If we want to minimize 
the set of parameters now, we can remove it for this release.


http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15733248
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * Trait for transformation of a vector
+ */
+@DeveloperApi
+trait VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  def transform(vector: Vector): Vector
+
+  /**
+   * Applies transformation on a RDD[Vector].
+   *
+   * @param data RDD[Vector] to be transformed.
+   * @return transformed RDD[Vector].
+   */
+  def transform(data: RDD[Vector]): RDD[Vector] = data.map(x = 
this.transform(x))
--- End diff --

Can you elaborate this?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15738936
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector = BDV, SparseVector = BSV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^p norm
+ *
+ * For any 1 = p  Double.Infinity, normalizes samples using 
sum(abs(vector).^p)^(1/p) as norm.
+ * For p = Double.Infinity, max(abs(vector)) will be used as norm for 
normalization.
+ * For p = Double.NegativeInfinity, min(abs(vector)) will be used as norm 
for normalization.
--- End diff --

matlab has L_{-inf}  http://www.mathworks.com/help/matlab/ref/norm.html for 
min(abs(X)). I agree that it's not useful for sparse data. Gonna remove it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15740021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector = BDV, SparseVector = BSV}
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^p^ norm
--- End diff --

lol...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-08-03 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1207#discussion_r15740240
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/StandardScalerSuite.scala 
---
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.mllib.rdd.RDDFunctions._
+import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, 
MultivariateOnlineSummarizer}
+import org.apache.spark.rdd.RDD
+
+class StandardScalerSuite extends FunSuite with LocalSparkContext {
+
+  private def computeSummary(data: RDD[Vector]): 
MultivariateStatisticalSummary = {
+data.treeAggregate(new MultivariateOnlineSummarizer)(
+  (aggregator, data) = aggregator.add(data),
+  (aggregator1, aggregator2) = aggregator1.merge(aggregator2))
+  }
+
+  test(Standardization with dense input) {
+val data = Array(
+  Vectors.dense(-2.0, 2.3, 0),
+  Vectors.dense(0.0, -1.0, -3.0),
+  Vectors.dense(0.0, -5.1, 0.0),
+  Vectors.dense(3.8, 0.0, 1.9),
+  Vectors.dense(1.7, -0.6, 0.0),
+  Vectors.dense(0.0, 1.9, 0.0)
+)
+
+val dataRDD = sc.parallelize(data, 3)
+
+val standardizer1 = new StandardScaler(withMean = true, withStd = true)
+val standardizer2 = new StandardScaler()
+val standardizer3 = new StandardScaler(withMean = true, withStd = 
false)
+
+withClue(Using a standardizer before fitting the model should throw 
exception.) {
+  intercept[IllegalStateException] {
+data.map(standardizer1.transform)
+  }
+}
+
+standardizer1.fit(dataRDD)
+standardizer2.fit(dataRDD)
+standardizer3.fit(dataRDD)
+
+val data1 = data.map(standardizer1.transform)
+val data2 = data.map(standardizer2.transform)
+val data3 = data.map(standardizer3.transform)
+
+val data1RDD = standardizer1.transform(dataRDD)
+val data2RDD = standardizer2.transform(dataRDD)
+val data3RDD = standardizer3.transform(dataRDD)
+
+val summary = computeSummary(dataRDD)
+val summary1 = computeSummary(data1RDD)
+val summary2 = computeSummary(data2RDD)
+val summary3 = computeSummary(data3RDD)
+
+assert((data, data1, data1RDD.collect()).zipped.forall(
+(v1, v2, v3) = (v1, v2, v3) match {
+  case (v1: DenseVector, v2: DenseVector, v3: DenseVector) = true
+  case (v1: SparseVector, v2: SparseVector, v3: SparseVector) = 
true
+  case _ = false
+}
+  ), The vector type should be preserved after standardization.)
+
+assert((data, data2, data2RDD.collect()).zipped.forall(
+(v1, v2, v3) = (v1, v2, v3) match {
+  case (v1: DenseVector, v2: DenseVector, v3: DenseVector) = true
+  case (v1: SparseVector, v2: SparseVector, v3: SparseVector) = 
true
+  case _ = false
+}
+  ), The vector type should be preserved after standardization.)
+
+assert((data, data3, data3RDD.collect()).zipped.forall(
+(v1, v2, v3) = (v1, v2, v3) match {
+  case (v1: DenseVector, v2: DenseVector, v3: DenseVector) = true
+  case (v1: SparseVector, v2: SparseVector, v3: SparseVector) = 
true
+  case _ = false
+}
+  ), The vector type should be preserved after standardization.)
+
+assert((data1, data1RDD.collect()).zipped.forall((v1, v2) = v1 ~== v2 
absTol 1E-5))
--- End diff --

For each RDD, I just call twice of collect(). I don't want to add another 
variable for this. (ps, RDD version is used for computing the summary stats, so 
we need both

[GitHub] spark pull request: [SPARK-2505][MLlib] Weighted Regularizer for G...

2014-08-04 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1518#issuecomment-51151346
  
It's too late to get into 1.1, but I'll try to make it happen in 1.2. We'll 
use this at Alpine implementation first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLlib] Use this.type as return type in k-mean...

2014-08-05 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1796

[MLlib] Use this.type as return type in k-means' builder pattern

to ensure that the return object is itself.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1796


commit 658989ef591ad28f891b275ccdc8137c5c180f46
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-08-06T01:30:32Z

Alpine Data Labs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1814#discussion_r15908219
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
  * @param withStd True by default. Scales the data to unit standard 
deviation.
  */
 @Experimental
-class StandardScaler(withMean: Boolean, withStd: Boolean) extends 
VectorTransformer {
+class StandardScaler(withMean: Boolean, withStd: Boolean) {
 
   def this() = this(false, true)
 
   require(withMean || withStd, swithMean and withStd both equal to false. 
Doing nothing.)
 
-  private var mean: BV[Double] = _
-  private var factor: BV[Double] = _
-
   /**
* Computes the mean and variance and stores as a model to be used for 
later scaling.
*
* @param data The data used to compute the mean and variance to build 
the transformation model.
-   * @return This StandardScalar object.
+   * @return a StandardScalarModel
*/
-  def fit(data: RDD[Vector]): this.type = {
+  def fit(data: RDD[Vector]): StandardScalerModel = {
 val summary = data.treeAggregate(new MultivariateOnlineSummarizer)(
   (aggregator, data) = aggregator.add(data),
   (aggregator1, aggregator2) = aggregator1.merge(aggregator2))
 
-mean = summary.mean.toBreeze
-factor = summary.variance.toBreeze
-require(mean.length == factor.length)
+val mean = summary.mean.toBreeze
+val factor = summary.variance.toBreeze
+require(mean.size == factor.size)
 
 var i = 0
-while (i  factor.length) {
+while (i  factor.size) {
   factor(i) = if (factor(i) != 0.0) 1.0 / math.sqrt(factor(i)) else 0.0
   i += 1
 }
 
-this
+new StandardScalerModel(withMean, withStd, mean, factor)
   }
+}
+
+/**
+ * :: Experimental ::
+ * Represents a StandardScaler model that can transform vectors.
+ */
+@Experimental
+class StandardScalerModel private[mllib] (
+val withMean: Boolean,
+val withStd: Boolean,
+val mean: BV[Double],
+val factor: BV[Double]) extends VectorTransformer {
 
--- End diff --

Since users may want to know the variance of the training set, should we 
have constructor 

class StandardScalerModel private[mllib] (
val withMean: Boolean,
val withStd: Boolean,
val mean: BV[Double],
val variance: BV[Double]) {

  lazy val factor = { 
val temp = variance.clone
while (i  temp.size) {
temp(i) = if (temp(i) != 0.0) 1.0 / math.sqrt(temp(i)) else 0.0
 i += 1
 temp
}
  }
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1814#discussion_r15908318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
@@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
  * @param withStd True by default. Scales the data to unit standard 
deviation.
  */
 @Experimental
-class StandardScaler(withMean: Boolean, withStd: Boolean) extends 
VectorTransformer {
+class StandardScaler(withMean: Boolean, withStd: Boolean) {
 
--- End diff --

This class is only used for keeping the state of withMean, and withStd, is 
it possible to move those states to fit function by overloading, and make it as 
object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1814#discussion_r15908504
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -177,18 +115,72 @@ private object IDF {
 private def isEmpty: Boolean = m == 0L
 
 /** Returns the current IDF vector. */
-def idf(): BDV[Double] = {
+def idf(): Vector = {
   if (isEmpty) {
 throw new IllegalStateException(Haven't seen any document yet.)
   }
   val n = df.length
-  val inv = BDV.zeros[Double](n)
+  val inv = new Array[Double](n)
   var j = 0
   while (j  n) {
 inv(j) = math.log((m + 1.0)/ (df(j) + 1.0))
 j += 1
   }
-  inv
+  Vectors.dense(inv)
 }
   }
 }
+
+/**
+ * :: Experimental ::
+ * Represents an IDF model that can transform term frequency vectors.
+ */
+@Experimental
+class IDFModel private[mllib] (val idf: Vector) extends Serializable {
+
+  /**
+   * Transforms term frequency (TF) vectors to TF-IDF vectors.
+   * @param dataset an RDD of term frequency vectors
+   * @return an RDD of TF-IDF vectors
+   */
+  def transform(dataset: RDD[Vector]): RDD[Vector] = {
+val bcIdf = dataset.context.broadcast(idf)
+dataset.mapPartitions { iter =
+  val thisIdf = bcIdf.value
+  iter.map { v =
+val n = v.size
+v match {
+  case sv: SparseVector =
+val nnz = sv.indices.size
+val newValues = new Array[Double](nnz)
+var k = 0
+while (k  nnz) {
+  newValues(k) = sv.values(k) * thisIdf(sv.indices(k))
+  k += 1
+}
+Vectors.sparse(n, sv.indices, newValues)
+  case dv: DenseVector =
+val newValues = new Array[Double](n)
+var j = 0
+while (j  n) {
+  newValues(j) = dv.values(j) * thisIdf(j)
+  j += 1
+}
+Vectors.dense(newValues)
+  case other =
+throw new UnsupportedOperationException(
--- End diff --

The following exception is used for unsupported vector in appendBias and 
StandardScaler, maybe we could have a global definition of this in util.
case v = throw new IllegalArgumentException(Do not support vector 
type  + v.getClass)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

2014-08-07 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1814#issuecomment-51511617
  
LGTM. Merged into both master and branch-1.1. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1862

[SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface

for training with LBFGS Optimizer which will converge faster than SGD.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-lbfgs-lor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1862


commit 3cf50c207e79c5f67cd5d06ff3f85f3538c23081
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-08-08T23:23:21Z

LogisticRegressionWithLBFGS interface




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16022431
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,98 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  override val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  override protected def createModel(weights: Vector, intercept: Double) = 
{
+new LogisticRegressionModel(weights, intercept)
+  }
+}
+
+/**
+ * Top-level methods for calling Logistic Regression using Limited-memory 
BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+object LogisticRegressionWithLBFGS {
--- End diff --

I don't mind about this. However, it will cause inconsistent api compared 
with LogisticRegressionWithSGD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16023077
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,54 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  // Have to be lazy since users can change the parameters after the class 
is created.
+  // PS, after the first train, the optimizer variable will be computed, 
so the parameters
+  // can not be changed anymore.
+  override lazy val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Double): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
--- End diff --

agreed! should we also change for the api in the optimizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2934][MLlib] Adding LogisticRegressionW...

2014-08-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1862#discussion_r16023299
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -188,3 +188,54 @@ object LogisticRegressionWithSGD {
 train(input, numIterations, 1.0, 1.0)
   }
 }
+
+/**
+ * Train a classification model for Logistic Regression using 
Limited-memory BFGS.
+ * NOTE: Labels used in Logistic Regression should be {0, 1}
+ */
+class LogisticRegressionWithLBFGS private (
+private var convergenceTol: Double,
+private var maxNumIterations: Int,
+private var regParam: Double)
+  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with 
Serializable {
+
+  private val gradient = new LogisticGradient()
+  private val updater = new SimpleUpdater()
+  // Have to be lazy since users can change the parameters after the class 
is created.
+  // PS, after the first train, the optimizer variable will be computed, 
so the parameters
+  // can not be changed anymore.
+  override lazy val optimizer = new LBFGS(gradient, updater)
+.setNumCorrections(10)
+.setConvergenceTol(convergenceTol)
+.setMaxNumIterations(maxNumIterations)
+.setRegParam(regParam)
+
+  override protected val validators = 
List(DataValidators.binaryLabelValidator)
+
+  /**
+   * Construct a LogisticRegression object with default parameters
+   */
+  def this() = this(1E-4, 100, 0.0)
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Double): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
--- End diff --

LBFGS.setMaxNumIterations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib ]Improve the convergence ra...

2014-08-11 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1897

[SPARK-2979][MLlib ]Improve the convergence rate by minimize the condition 
number

Scaling to minimize the condition number:
During the optimization process, the convergence (rate) depends on the 
condition number of the training dataset. Scaling the variables often reduces 
this condition number, thus mproving the convergence rate dramatically. Without 
reducing the condition number, some training datasets mixing the columns with 
different scales may not be able to converge.
GLMNET and LIBSVM packages perform the scaling to reduce the condition 
number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
Here, if useFeatureScaling is enabled, we will standardize the training 
features by dividing the variance of each column (without subtracting the 
mean), and train the model in the scaled space. Then we transform the 
coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
do.
Currently, it's only enabled in LogisticRegressionWithLBFGS


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1897


commit 5257751cda9cd0cb284af06c81e1282e1bfb53f7
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-08-08T23:23:21Z

Improve the convergence rate by minimize the condition number in LOR with 
LBFGS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1897#discussion_r16153527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 ---
@@ -137,11 +154,45 @@ abstract class GeneralizedLinearAlgorithm[M : 
GeneralizedLinearModel]
   throw new SparkException(Input validation failed.)
 }
 
+/**
+ * Scaling to minimize the condition number:
+ *
+ * During the optimization process, the convergence (rate) depends on 
the condition number of
+ * the training dataset. Scaling the variables often reduces this 
condition number, thus
+ * improving the convergence rate dramatically. Without reducing the 
condition number,
+ * some training datasets mixing the columns with different scales may 
not be able to converge.
+ *
+ * GLMNET and LIBSVM packages perform the scaling to reduce the 
condition number, and return
+ * the weights in the original scale.
+ * See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
+ *
+ * Here, if useFeatureScaling is enabled, we will standardize the 
training features by dividing
+ * the variance of each column (without subtracting the mean), and 
train the model in the
+ * scaled space. Then we transform the coefficients from the scaled 
space to the original scale
+ * as GLMNET and LIBSVM do.
+ *
+ * Currently, it's only enabled in LogisticRegressionWithLBFGS
+ */
+val scaler = if (useFeatureScaling) {
+  (new StandardScaler).fit(input.map(x = x.features))
+} else {
+  null
+}
+
 // Prepend an extra variable consisting of all 1.0's for the intercept.
 val data = if (addIntercept) {
-  input.map(labeledPoint = (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  if(useFeatureScaling) {
+input.map(labeledPoint =
+  (labeledPoint.label, 
appendBias(scaler.transform(labeledPoint.features
+  } else {
+input.map(labeledPoint = (labeledPoint.label, 
appendBias(labeledPoint.features)))
+  }
 } else {
-  input.map(labeledPoint = (labeledPoint.label, 
labeledPoint.features))
+  if (useFeatureScaling) {
+input.map(labeledPoint = (labeledPoint.label, 
scaler.transform(labeledPoint.features)))
+  } else {
+input.map(labeledPoint = (labeledPoint.label, 
labeledPoint.features))
--- End diff --

It's not identical map. It's converting labeledPoint to tuple of response 
and feature vector for optimizer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Minor change in the comment of spark-defaults....

2014-10-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2709

Minor change in the comment of spark-defaults.conf.template

spark-defaults.conf is used in spark-shell as well, and this PR added this 
into the comment.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2709


commit b3e1ff1b808380707d04277c2379bf5b03556662
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-10-08T08:53:25Z

add spark-shell




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3121] Wrong implementation of implicit ...

2014-10-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2712#issuecomment-58361701
  
Jenkins, please start the test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3856][MLLIB] use norm operator after br...

2014-10-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2718#issuecomment-58435304
  
LGTM  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3121] Wrong implementation of implicit ...

2014-10-10 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2712#issuecomment-58629065
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3121] Wrong implementation of implicit ...

2014-10-10 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2712#issuecomment-58732030
  
It's failing at FlumeStreamSuite.scala:109 which seems to be unrelated to 
this patch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Minor change in the comment of spark-defaults....

2014-10-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2709#issuecomment-59667207
  
@andrewor14 Sorry for late reply since I was on vacation in Europe last 
week. I can continue work on this after I finish my talk in IOTA conf tomorrow. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59871504
  
Jenkins, please start the test!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-10-28 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-60813678
  
@BigCrunsh I'm working on this. Let's see if we can merge in Spark 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...

2014-10-28 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2992

[SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer

In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
through the nonZero elements in the vector. However, activeIterator doesn't 
perform well due to lots of overhead. In this PR, native while loop is used for 
both DenseVector and SparseVector.
The benchmark result with 20 executors using mnist8m dataset:
Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds
After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds
Since MultivariateOnlineSummarizer is used in several places, the overall 
performance gain in mllib library will be significant with this PR.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark SPARK-4129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2992.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2992


commit ebe3e74df70eb424aecc3170fc55008cfb6a76ec
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-10-29T05:42:50Z

First commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1870] Ported from 1.0 branch to 0.9 bra...

2014-06-09 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1013#issuecomment-45551414
  
Tested in PivotalHD 1.1 Yarn 4 node cluster. With --addjars 
file:///somePath/to/jar, launching spark application works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Made deployment with --jars work ...

2014-06-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1013#discussion_r13573544
  
--- Diff: 
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -507,12 +508,19 @@ object Client {
   Apps.addToEnvironment(env, Environment.CLASSPATH.name, 
Environment.PWD.$() +
 Path.SEPARATOR + LOG4J_PROP)
 }
+
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse().split(,)
--- End diff --

Thanks. You are right. It will add empty string to array, and then add the 
folder without file into classpath. Will fix in master as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Make sure that empty string is filtered out wh...

2014-06-09 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1027

Make sure that empty string is filtered out when we get the secondary jars 
from conf



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-classloader

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1027


commit c9c7ad7fc6a2cf03503fe7b19ea1da92247196c6
Author: DB Tsai dbt...@dbtsai.com
Date:   2014-06-10T01:29:04Z

Make sure that empty string is filtered out when we get the secondary jars 
from conf.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624385
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory  maxMem) {
-  logError(Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster..
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster..
+format(args.executorMemory, maxMem)
+  logError(errorMessage)
+  throw new IllegalArgumentException(errorMessage)
 }
 val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD
 if (amMem  maxMem) {
-  logError(Required AM memory (%d) is above the max threshold (%d) of 
this cluster.
-format(args.amMemory, maxMem))
-  System.exit(1)
+  val errorMessage =Required AM memory (%d) is above the max 
threshold (%d) of this cluster.
--- End diff --

Please add a space after =


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624580
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory  maxMem) {
-  logError(Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster..
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster..
+format(args.executorMemory, maxMem)
--- End diff --

Move the . to the new line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r13624615
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -95,15 +96,18 @@ trait ClientBase extends Logging {
 
 // If we have requested more then the clusters max for a single 
resource then exit.
 if (args.executorMemory  maxMem) {
-  logError(Required executor memory (%d MB), is above the max 
threshold (%d MB) of this cluster..
-format(args.executorMemory, maxMem))
-  System.exit(1)
+  val errorMessage =
+Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster..
+format(args.executorMemory, maxMem)
+  logError(errorMessage)
+  throw new IllegalArgumentException(errorMessage)
 }
 val amMem = args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD
 if (amMem  maxMem) {
-  logError(Required AM memory (%d) is above the max threshold (%d) of 
this cluster.
-format(args.amMemory, maxMem))
-  System.exit(1)
+  val errorMessage =Required AM memory (%d) is above the max 
threshold (%d) of this cluster.
--- End diff --

move the . to the newline 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-12 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-45835283
  
@mengxr Do you think it's in good shape now? This is the only issue 
blocking us using vanilla spark. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13897737
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -38,10 +38,10 @@ import org.apache.spark.mllib.linalg.{Vectors, Vector}
 class LBFGS(private var gradient: Gradient, private var updater: Updater)
   extends Optimizer with Logging {
 
-  private var numCorrections = 10
-  private var convergenceTol = 1E-4
-  private var maxNumIterations = 100
-  private var regParam = 0.0
+  private var numCorrections: Int = 10
+  private var convergenceTol: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
 
--- End diff --

In most of the mllib codebase, we don't specify the type of variable. Can 
you remove them? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13897825
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with 
LocalSparkContext with Matchers {
 assert(lossLBFGS3.length == 6)
 assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4)  
convergenceTol)
   }
+
--- End diff --

The bug isn't found because we only test the static runLBFGS method instead 
of the class. We probably can change all the existing tests to use the one in 
class, so we don't need to add another test.  

@mengxr what do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-17 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1104#issuecomment-46393840
  
I think it's legacy reason to have two different way to access the API. As 
far as I know, @mengxr is working on consolidating the interface. He probably 
can talk about more on this topic. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1104#discussion_r13905548
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -195,4 +195,39 @@ class LBFGSSuite extends FunSuite with 
LocalSparkContext with Matchers {
 assert(lossLBFGS3.length == 6)
 assert((lossLBFGS3(4) - lossLBFGS3(5)) / lossLBFGS3(4)  
convergenceTol)
   }
+
--- End diff --

We may add the same test to SGD as well. My bad. Our internal one is right. 
Probably when I copy and paste, I don't do thing right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-18 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1104#issuecomment-46412293
  
I think it will be a problem for MIMA to change the signature. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...

2014-06-24 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1207

SPARK-2272 [MLlib] Feature scaling which standardizes the range of 
independent variables or features of data

Feature scaling is a method used to standardize the range of independent 
variables or features of data. In data processing, it is also known as data 
normalization and is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic 
transformation of a vector. It contains two methods, `apply` which applies 
transformation on a vector and `unapply` which applies inverse transformation 
on a vector.

There are three concrete implementations of `VectorTransformer`, and they 
all can be easily extended with PMML transformation support.

1) `VectorStandardizer` - Standardises a vector given the mean and 
variance. Since the standardization will densify the output, the output is 
always in dense vector format.

2) `VectorRescaler` - Rescales a vector into target range specified by a 
tuple of two double values or two vectors as new target minimum and maximum. 
Since the rescaling will substrate the minimum of each column first, the output 
will always be in dense vector regardless of input vector type.

3) `VectorDivider` - Transforms a vector by dividing a constant or diving a 
vector with element by element basis. This transformation will preserve the 
type of input vector without densifying the result.

Utility helper methods are implemented for taking an input of RDD[Vector], 
and then transformed RDD[Vector] and transformer are returned for dividing, 
rescaling, normalization, and standardization.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-feature-scaling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1207


commit d3daa997c9a51a4af8f67cbcdb3738e5ba8c4b56
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-06-25T02:30:16Z

Feature scaling which standardizes the range of independent variables or 
features of data.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2281 [MLlib] Simplify the duplicate code...

2014-06-25 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1215

SPARK-2281 [MLlib] Simplify the duplicate code in Gradient.scala

The Gradient.compute which returns new tuple of (gradient: Vector, loss: 
Double) can be constructed by in-place version of Gradient.compute. Thus, we 
don't need to maintain the duplicate code.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-gradient-simplification

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1215.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1215


commit b2595d334c0d6246fe904b8c00ca3d51dc88f71a
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-06-25T22:08:30Z

Simplify the gradient




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-06-26 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1099#issuecomment-47250277
  
Seems that the jenkins is missing the python runtime. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-07-01 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47683286
  
We benchmarked treeReduce in our random forest implementation, and since 
the trees generated from each partition are fairly large (more than 100MB), we 
found that treeReduce can significantly reduce the shuffle time from 6mins to 
2mins. Nice work! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade junit_xml_listener to 0.5.1 which fixe...

2014-07-08 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1333

Upgrade junit_xml_listener to 0.5.1 which fixes the following issues

1) fix the class name to be fully qualified classpath
2) make sure the the reporting time is in second not in miliseond, which 
causing JUnit HTML to report incorrect number
3) make sure the duration of the tests are accumulative.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-junit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1333


commit bbeac4b1bb8635eec2b046f1c4cfd15b64d0
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-07-08T18:44:47Z

Upgrade junit_xml_listener to 0.5.1 which fixes the following issues

1) fix the class name to be fully qualified classpath
2)  make sure the the reporting time is in second not in miliseond, which 
causing JUnit HTML to report incorrect number
3)  make sure the duration of the tests are accumulative.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade junit_xml_listener to 0.5.1 which fixe...

2014-07-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1333#issuecomment-48417558
  
done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2281 [MLlib] Simplify the duplicate code...

2014-07-09 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/1215


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14796461
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector = BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
+ * a streaming fashion.
+ *
+ * Two OnlineSummarizers can be merged together to have a statistical 
summary of a jointed dataset.
+ *
+ * A numerically stable algorithm is implemented to compute sample mean 
and variance:
+ * Reference: 
[[http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
variance-wiki]]
+ * Zero elements (including explicit zero values) are skipped when calling 
add(),
+ * to have time complexity O(nnz) instead of O(n) for each column.
+ */
+@DeveloperApi
+class OnlineSummarizer extends MultivariateStatisticalSummary with 
Serializable {
--- End diff --

I actually want to change MultivariateStatisticalSummary to 
StatisticalSummary since it's too verbose. But for consistency, I will change 
it to MultivariateOnlineSummarizer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-07-11 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/987#issuecomment-48762832
  
#560 is merged. Close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1177] Allow SPARK_JAR to be set program...

2014-07-11 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/987


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-07-11 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1379

[SPARK-2309][MLlib] Generalize the binary logistic regression into 
multinomial logistic regression

Currently, there is no multi-class classifier in mllib. Logistic regression 
can be extended to multinomial classifier straightforwardly.
The following formula will be implemented. 
http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25

Note: When multi-classes mode, there will be multiple intercepts, so we 
don't use the single intercept in `GeneralizedLinearModel`, and have all the 
intercepts into weights. It makes some inconsistency. For example, in the 
binary mode, the intercept can not be specified by users, but since in the 
multinomial mode, the intercepts are combined into weights, users can specify 
them. 

@mengxr Should we just deprecate the intercept, and have everything in 
weights? It makes sense in term of optimization point of view, and also make 
the interface cleaner. Thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-mlor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1379


commit 82dae74135bafa5d1adeef4b2b421693c05b2778
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-06-27T21:47:15Z

Multinomial Logistic Regression




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2477][MLlib] Using appendBias for addin...

2014-07-14 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1410

[SPARK-2477][MLlib] Using appendBias for adding intercept in 
GeneralizedLinearAlgorithm

Instead of using prependOne currently in GeneralizedLinearAlgorithm, we 
would like to use appendBias for 1) keeping the indices of original training 
set unchanged by adding the intercept into the last element of vector and 2) 
using the same public API for consistently adding intercept.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark 
SPARK-2477_intercept_with_appendBias

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1410.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1410


commit 011432cd2f815aacd9b12e770e5c6ec16ea716aa
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-07-14T22:04:01Z

From Alpine Data Labs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-15 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/1425

[SPARK-2479][MLlib] Comparing floating-point numbers using relative error 
in UnitTests

Floating point math is not exact, and most floating-point numbers end up 
being slightly imprecise due to rounding errors. Simple values like 0.1 cannot 
be precisely represented using binary floating point numbers, and the limited 
precision of floating point numbers means that slight changes in the order of 
operations or the precision of intermediates can change the result. That means 
that comparing two floats to see if they are equal is usually not what we want. 
As long as this imprecision stays small, it can usually be ignored.
See the following famous article for detail.

http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
For example:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a = b) // can also be false!

(ps, not all the tests involving floating point comparisons are changed to 
use almostEquals) 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark 
SPARK-2479_comparing_floating_point

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1425.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1425


commit f4da8f4f8693763b4823e36e3d270b74a7ce67bf
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-07-14T23:24:11Z

Alpine Data Labs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r15013544
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -81,9 +82,8 @@ class LogisticRegressionSuite extends FunSuite with 
LocalSparkContext with Match
 val model = lr.run(testRDD)
 
 // Test the weights
-val weight0 = model.weights(0)
-assert(weight0 = -1.60  weight0 = -1.40, weight0 +  not in [-1.6, 
-1.4])
-assert(model.intercept = 1.9  model.intercept = 2.1, 
model.intercept +  not in [1.9, 2.1])
+assert(model.weights(0).almostEquals(-1.5244128696247), weight0 
should be -1.5244128696247)
--- End diff --

We can have higher relative error here instead. If the implementation is 
changed, it's also nice to have a test which can catch the slightly different 
behavior. Also, updating those numbers will not take too much time comparing 
with the implementation work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r15013786
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala
 ---
@@ -20,8 +20,20 @@ package org.apache.spark.mllib.evaluation
 import org.scalatest.FunSuite
 
 import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
 
 class BinaryClassificationMetricsSuite extends FunSuite with 
LocalSparkContext {
+
+  implicit class SeqDoubleWithAlmostEquals(val x: Seq[Double]) {
+def almostEquals(y: Seq[Double], eps: Double = 1E-6): Boolean =
--- End diff --

Yeah, for one ulp, it might be 10e-15. Lots of time, I manually type the 
numbers or just copy the first couple dights of numbers to save the line space, 
so that's why I chose 1.0e-6. Thus, I can just type around 7 digits of numbers. 

I agree with you that in this case, we may want to explicitly specify with 
larger epsilon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49221370
  
@mengxr  Scalatest 2.x has the tolerance feature, but it's absolute error 
not relative error. For large numbers, the absolute error may not be 
meaningful. With `===`, it will return false even the different is only one 
unit of least precision (ULP), and it often happens when running the unittest 
under different architecture of machine. For example, ARM and X86 may have 
different numerical rounding , and we don't run any test other than X86. C++ 
boost has their numerical `===` test with the relative error for this reason.

I probably can add method called `~=` and `~==` method for `Double`, and 
`Vector` type using implicit class, and `~==` will raise the exception for the 
message purpose like `===` does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49222983
  
I learn `almostEquals` from boost library. Anyway, in this case, how do we 
distinguish the one with throwing out the message, and the one just returning 
true/false?

`almostEquals` and `almostEqualsWithMessage`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49253108
  
@mengxr  and @srowen  What do you think `assert((0.0001 !~== 0.0) +- 
1E-5)`? We have `~==` and `~==` which will have the error message in the latest 
commit from my co-worker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157 L-BFGS Optimizer based on Breeze L-...

2014-04-07 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/53


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-07 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/353

SPARK-1157: L-BFGS Optimizer based on Breeze's implementation.

This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !

When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input 
parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by 

val regVal = updater.compute(
  weights,
  new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

  val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without 
regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #353


commit 60c83350bb77aa640edd290a26e2a20281b7a3a8
Author: DB Tsai dbt...@dbtsai.com
Date:   2014-04-05T00:06:50Z

L-BFGS Optimizer based on Breeze's implementation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11404094
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector = BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3  m  10 is recommended.
+   * Restriction: m  0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections  0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
+this.lineSearchTolerance = tolerance
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvTolerance(tolerance: Int): this.type = {
+this.convTolerance = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data

[GitHub] spark pull request: SPARK-1157: L-BFGS Optimizer based on Breeze's...

2014-04-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-39895140
  
@mengxr  As you suggested, I moved the costFun to private CostFun class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11460767
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector = BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
--- End diff --

@mengxr  
I know. I pretty much follow the existing coding style in 
GradientDescent.scala 
Should I also change the one in other place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11461398
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.Array
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector = BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(var gradient: Gradient, var updater: Updater)
+  extends Optimizer with Logging
+{
+  private var numCorrections: Int = 10
+  private var lineSearchTolerance: Double = 0.9
+  private var convTolerance: Double = 1E-4
+  private var maxNumIterations: Int = 100
+  private var regParam: Double = 0.0
+  private var miniBatchFraction: Double = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of m less than 3 are not recommended; large values of m
+   * will result in excessive computing time. 3  m  10 is recommended.
+   * Restriction: m  0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections  0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set the tolerance to control the accuracy of the line search in 
mcsrch step. Default 0.9.
+   * If the function and gradient evaluations are inexpensive with respect 
to the cost of
+   * the iteration (which is sometimes the case when solving very large 
problems) it may
+   * be advantageous to set to a small value. A typical small value is 0.1.
+   * Restriction: should be greater than 1e-4.
+   */
+  def setLineSearchTolerance(tolerance: Double): this.type = {
--- End diff --

Good catch! It's used in RISO implementation. Just remove them. Thks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11463764
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =
+label - Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty(spark.driver.port)
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15)  tol
+  }
+
+  test(Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.) {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head  0, loss isn't decreasing.)
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) = lhs - rhs
+}
+assert(lossDiff.count(_  0).toDouble / lossDiff.size  0.8)
--- End diff --

This 0.8 bound is copying from GradientDescentSuite, and L-BFGS should at 
least have the same performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464280
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =
+label - Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty(spark.driver.port)
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15)  tol
+  }
+
+  test(Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.) {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head  0, loss isn't decreasing.)
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) = lhs - rhs
+}
+assert(lossDiff.count(_  0).toDouble / lossDiff.size  0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last)  0.05,
+  LBFGS should match GD result within 5% error.)
+  }
+
+  test(Assert that LBFGS and Gradient Descent with L2 regularization get 
the same result.) {
+val regParam = 0.2
+
+// Prepare another non-zero weights to compare the loss in the first 
iteration.
+val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
+
+val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  squaredL2Updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+// With regularization, GD converges faster now!
+// So we only need 20 iterations to get

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11464736
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala ---
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+import org.scalatest.matchers.ShouldMatchers
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+class LBFGSSuite extends FunSuite with BeforeAndAfterAll with 
ShouldMatchers {
+  @transient private var sc: SparkContext = _
+  var dataRDD:RDD[(Double, Vector)] = _
+
+  val nPoints = 1
+  val A = 2.0
+  val B = -1.5
+
+  val initialB = -1.0
+  val initialWeights = Array(initialB)
+
+  val gradient = new LogisticGradient()
+  val numCorrections = 10
+  val lineSearchTolerance = 0.9
+  var convTolerance = 1e-12
+  var maxNumIterations = 10
+  val miniBatchFrac = 1.0
+
+  val simpleUpdater = new SimpleUpdater()
+  val squaredL2Updater = new SquaredL2Updater()
+
+  // Add a extra variable consisting of all 1.0's for the intercept.
+  val testData = GradientDescentSuite.generateGDInput(A, B, nPoints, 42)
+  val data = testData.map { case LabeledPoint(label, features) =
+label - Vectors.dense(1.0, features.toArray: _*)
+  }
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+dataRDD = sc.parallelize(data, 2).cache()
+  }
+
+  override def afterAll() {
+sc.stop()
+System.clearProperty(spark.driver.port)
+  }
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15)  tol
+  }
+
+  test(Assert LBFGS loss is decreasing and matches the result of Gradient 
Descent.) {
+val updater = new SimpleUpdater()
+val regParam = 0
+
+val initialWeightsWithIntercept = Vectors.dense(1.0, initialWeights: 
_*)
+
+val (_, loss) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(loss.last - loss.head  0, loss isn't decreasing.)
+
+val lossDiff = loss.init.zip(loss.tail).map {
+  case (lhs, rhs) = lhs - rhs
+}
+assert(lossDiff.count(_  0).toDouble / lossDiff.size  0.8)
+
+val stepSize = 1.0
+// Well, GD converges slower, so it requires more iterations!
+val numGDIterations = 50
+val (_, lossGD) = GradientDescent.runMiniBatchSGD(
+  dataRDD,
+  gradient,
+  updater,
+  stepSize,
+  numGDIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+assert(Math.abs((lossGD.last - loss.last) / loss.last)  0.05,
+  LBFGS should match GD result within 5% error.)
+  }
+
+  test(Assert that LBFGS and Gradient Descent with L2 regularization get 
the same result.) {
+val regParam = 0.2
+
+// Prepare another non-zero weights to compare the loss in the first 
iteration.
+val initialWeightsWithIntercept = Vectors.dense(0.3, 0.12)
+
+val (weightLBFGS, lossLBFGS) = LBFGS.runMiniBatchLBFGS(
+  dataRDD,
+  gradient,
+  squaredL2Updater,
+  numCorrections,
+  lineSearchTolerance,
+  convTolerance,
+  maxNumIterations,
+  regParam,
+  miniBatchFrac,
+  initialWeightsWithIntercept)
+
+// With regularization, GD converges faster now!
+// So we only need 20 iterations to get

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/353#discussion_r11605070
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector = BDV, axpy}
+import breeze.optimize.{CachedDiffFunction, DiffFunction}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+
+/**
+ * Class used to solve an optimization problem using Limited-memory BFGS.
+ * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]]
+ * @param gradient Gradient function to be used.
+ * @param updater Updater to be used to update weights after every 
iteration.
+ */
+class LBFGS(private var gradient: Gradient, private var updater: Updater)
+  extends Optimizer with Logging {
+
+  private var numCorrections = 10
+  private var convergenceTol = 1E-4
+  private var maxNumIterations = 100
+  private var regParam = 0.0
+  private var miniBatchFraction = 1.0
+
+  /**
+   * Set the number of corrections used in the LBFGS update. Default 10.
+   * Values of numCorrections less than 3 are not recommended; large values
+   * of numCorrections will result in excessive computing time.
+   * 3  numCorrections  10 is recommended.
+   * Restriction: numCorrections  0
+   */
+  def setNumCorrections(corrections: Int): this.type = {
+assert(corrections  0)
+this.numCorrections = corrections
+this
+  }
+
+  /**
+   * Set fraction of data to be used for each L-BFGS iteration. Default 
1.0.
+   */
+  def setMiniBatchFraction(fraction: Double): this.type = {
+this.miniBatchFraction = fraction
+this
+  }
+
+  /**
+   * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   */
+  def setConvergenceTol(tolerance: Int): this.type = {
+this.convergenceTol = tolerance
+this
+  }
+
+  /**
+   * Set the maximal number of iterations for L-BFGS. Default 100.
+   */
+  def setMaxNumIterations(iters: Int): this.type = {
+this.maxNumIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set the gradient function (of the loss function of one single data 
example)
+   * to be used for L-BFGS.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set the updater function to actually perform a gradient step in a 
given direction.
+   * The updater is responsible to perform the update from the 
regularization term as well,
+   * and therefore determines what kind or regularization is used, if any.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  override def optimize(data: RDD[(Double, Vector)], initialWeights: 
Vector): Vector = {
+val (weights, _) = LBFGS.runMiniBatchLBFGS(
+  data,
+  gradient,
+  updater,
+  numCorrections,
+  convergenceTol,
+  maxNumIterations,
+  regParam,
+  miniBatchFraction,
+  initialWeights)
+weights
+  }
+
+}
+
+/**
+ * Top-level method to run LBFGS.
+ */
+object LBFGS extends Logging {
+  /**
+   * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches.
+   * In each iteration, we sample a subset (fraction miniBatchFraction) of 
the total data
+   * in order

[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434555
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
GitHub user dbtsai reopened a pull request:

https://github.com/apache/spark/pull/353

[SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.

This PR uses Breeze's L-BFGS implement, and Breeze dependency has already 
been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice 
work, @mengxr !

When use with regularized updater, we need compute the regVal and 
regGradient (the gradient of regularized part in the cost function), and in the 
currently updater design, we can compute those two values by the following way.

Let's review how updater works when returning newWeights given the input 
parameters.

w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that 
regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'

As a result, for regVal, it can be computed by 

val regVal = updater.compute(
  weights,
  new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by

  val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 
1), 1, 1, regParam)._1)

The PR includes the tests which compare the result with SGD with/without 
regularization.

We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).

The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #353


commit 984b18e21396eae84656e15da3539ff3b5f3bf4a
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-04-05T00:06:50Z

L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation 
issue in GradientDescent optimizer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434626
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/353#issuecomment-40434691
  
Timeout for lastest jenkins run. It seems that CI is not stable now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...

2014-04-15 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLlib doc update for breeze dependency

2014-04-22 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/481

MLlib doc update for breeze dependency

MLlib is now using breeze linear algebra library instead of jblas; this PR 
will update the doc to help users to install the blas native libraries to have 
better performance in netlib-java which breeze depends on. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-LBFGSdocs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/481.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #481


commit eddb3ddfd036035b4b8c639450e4d48db6afd4d4
Author: DB Tsai dbt...@dbtsai.com
Date:   2014-04-22T07:35:44Z

Fixed MLlib doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLlib doc update for breeze dependency

2014-04-22 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/481


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1506][MLLIB] Documentation improvements...

2014-04-22 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/422#discussion_r11841916
  
--- Diff: docs/mllib-guide.md ---
@@ -3,63 +3,120 @@ layout: global
 title: Machine Learning Library (MLlib)
 ---
 
+MLlib is a Spark implementation of some common machine learning algorithms 
and utilities,
+including classification, regression, clustering, collaborative
+filtering, dimensionality reduction, as well as underlying optimization 
primitives:
 
-MLlib is a Spark implementation of some common machine learning (ML)
-functionality, as well associated tests and data generators.  MLlib
-currently supports four common types of machine learning problem settings,
-namely classification, regression, clustering and collaborative filtering,
-as well as an underlying gradient descent optimization primitive and 
several
-linear algebra methods.
-
-# Available Methods
-The following links provide a detailed explanation of the methods and 
usage examples for each of them:
-
-* a href=mllib-classification-regression.htmlClassification and 
Regression/a
-  * Binary Classification
-* SVM (L1 and L2 regularized)
-* Logistic Regression (L1 and L2 regularized)
-  * Linear Regression
-* Least Squares
-* Lasso
-* Ridge Regression
-  * Decision Tree (for classification and regression)
-* a href=mllib-clustering.htmlClustering/a
-  * k-Means
-* a href=mllib-collaborative-filtering.htmlCollaborative Filtering/a
-  * Matrix Factorization using Alternating Least Squares
-* a href=mllib-optimization.htmlOptimization/a
-  * Gradient Descent and Stochastic Gradient Descent
-* a href=mllib-linear-algebra.htmlLinear Algebra/a
-  * Singular Value Decomposition
-  * Principal Component Analysis
-
-# Data Types
-
-Most MLlib algorithms operate on RDDs containing vectors. In Java and 
Scala, the
-[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class 
is used to
-represent vectors. You can create either dense or sparse vectors using the
-[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) 
factory.
-
-In Python, MLlib can take the following vector types:
-
-* [NumPy](http://www.numpy.org) arrays
-* Standard Python lists (e.g. `[1, 2, 3]`)
-* The MLlib 
[SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class
-* [SciPy sparse 
matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html)
-
-For efficiency, we recommend using NumPy arrays over lists, and using the
-[CSC 
format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
-for SciPy matrices, or MLlib's own SparseVector class.
-
-Several other simple data types are used throughout the library, e.g. the 
LabeledPoint
-class 
([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint),
-[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) 
for labeled data.
-
-# Dependencies
-MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra 
library, which itself
-depends on native Fortran routines. You may need to install the
-[gfortran runtime 
library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries)
-if it is not already present on your nodes. MLlib will throw a linking 
error if it cannot
-detect these libraries automatically.
+* [Basics](mllib-basics.html)
+  * data types 
+  * summary statistics
+* Classification and regression
+  * [linear support vector machine 
(SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm)
+  * [logistic regression](mllib-linear-methods.html#logistic-regression)
+  * [linear least squares, Lasso, and ridge 
regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
+  * [decision tree](mllib-decision-tree.html)
+  * [naive Bayes](mllib-naive-bayes.html)
+* [Collaborative filtering](mllib-collaborative-filtering.html)
+  * alternating least squares (ALS)
+* [Clustering](mllib-clustering.html)
+  * k-means
+* [Dimensionality reduction](mllib-dimensionality-reduction.html)
+  * singular value decomposition (SVD)
+  * principal component analysis (PCA)
+* [Optimization](mllib-optimization.html)
+  * stochastic gradient descent
+  * limited-memory BFGS (L-BFGS)
+
+MLlib is currently a beta component under active development.
+The APIs may be changed in the future releases, and we will provide 
migration guide between releases.
+
+## Dependencies
+
+MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), 
which depends on
+[netlib-java](https://github.com/fommil/netlib-java), and
+[jblas](https

[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-04-22 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/490#discussion_r11883381
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -77,7 +78,8 @@ trait ClientBase extends Logging {
 ).foreach { case(cond, errStr) =
   if (cond) {
 logError(errStr)
-args.printUsageAndExit(1)
+throw new IllegalArgumentException(args.getUsageMessage())
+
--- End diff --

Remove this empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1516]Throw exception in yarn client ins...

2014-04-22 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/490#issuecomment-41114289
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2979][MLlib] Improve the convergence ra...

2014-08-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1897#issuecomment-52149162
  
Seems that Jenkins is not stable. Failing on issues related to akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...

2014-08-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1973#discussion_r16319946
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -69,8 +69,17 @@ class LBFGS(private var gradient: Gradient, private var 
updater: Updater)
 
   /**
* Set the maximal number of iterations for L-BFGS. Default 100.
+   * @deprecated use [[setNumIterations()]] instead
*/
+  @deprecated(use setNumIterations instead, 1.1.0)
   def setMaxNumIterations(iters: Int): this.type = {
+this.setNumCorrections(iters)
--- End diff --

Should it be 

this. setNumIterations(iters)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3078][MLLIB] Make LRWithLBFGS API consi...

2014-08-15 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1973#issuecomment-52381503
  
LGTM. Merged into both master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2068

[SPARK-2841][MLlib] Documentation for feature transformations

Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark transformer-documentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2068


commit e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-08-20T22:21:26Z

documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16561045
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) {
 /div
 /div
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
--- End diff --

How about I say
For example, RBF kernel of Support Vector Machines
or the L1 and L2 regularized linear models typically works better when all 
features have unit variance
and/or zero mean.

I actually have this statement from scikit documentation.  

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >