[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10691012 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10691105 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10691283 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-18 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10692570 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-03-18 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/166#issuecomment-37930965 In fact, if we set the `numInnerIteration = 1`, which is the default setting, then the `GradientDescentWithLocalUpdate` is identical to `GradientDescent`. However, I

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-03-18 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/166#discussion_r10734823 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescentWithLocalUpdate.scala --- @@ -0,0 +1,147 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38128624 @mengxr Your advise makes sense. I remove the merge process away from `smallTextFiles()`, and rewrite the reading logic in `RecoderReader`. --- If your project is set

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38129529 Ah... It seems that Jenkins causes problem. The last two commits test failed due to this error: Fetching upstream changes from https://github.com/apache

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786231 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786751 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-03-20 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/166#discussion_r10787371 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescentWithLocalUpdate.scala --- @@ -0,0 +1,147 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38245425 @mengxr There are 2 java files in my PR, and another 2 scala files - the MLUtils.scala and the test suite. I just find the scala code style in the [style page](https

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10829857 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-03-21 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/166#issuecomment-38296681 I use the new method to enlarge local update. Test on SVM and LogisticRegression looks as good as the first version, without the worry of OOM. This method can get better

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-03-22 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/166#issuecomment-38345315 I have test the original/1-version/2-version LR and SVM, here is the result: (Note that original version runs 100 iterations, while the other two run 10

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10871469 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38519654 @mengxr I talked to @liancheng about the placement of WholeTextFiles interface, we have no idea of whether it is a commonly used interface or not at that time, so

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10966132 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38761534 Sure, let me update it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38765347 Oh... Is that OK? That's strange... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen closed the pull request at: https://github.com/apache/spark/pull/164 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-03-27 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/252 [SPARK-1133] Add whole text files reader in MLlib Here is a pointer to the former [PR164](https://github.com/apache/spark/pull/164). I add the pull request for the JIRA issue [SPARK-1133

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-03-27 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-38773117 It seems that the test process is suddenly aborted. Can we retest it? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-1212, Part II] [WIP] Support sparse dat...

2014-03-27 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/245#discussion_r11013983 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/RidgeRegression.scala --- @@ -67,44 +70,50 @@ class RidgeRegressionWithSGD private

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-03-27 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-38871767 I think it is OK, @rxin shall we merge it? :) 2014-3-27 PM4:40ÓÚ UCB AMPLab notificati...@github.comдµÀ£º All automated tests passed. Refer

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-03-28 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/268 [WIP] [SPARK-1328] Add vector statistics As with the new vector system in MLlib, we find that it is good to add some new APIs to precess the `RDD[Vector]`. Beside, the former implementation

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-03-30 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39047305 Hi @mateiz , here is my explanation: * Hadoop has no such input formant, but Mahout has. It is called `org.apache.mahout.text.SequenceFilesFromDirectory

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-03-31 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-39110100 @mengxr I am not very sure of the concept of sparse vector. In your example, do you mean the column is `Vector(1.0, 0.0, 2.0, 0.0, 3.0, 0.0, 0.0)` or `RDD

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-03-31 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39122031 @mengxr I add a `sc.hadoopConfiguration.setLong(fs.local.block.size, 32)` in the test code, which can limit the block size to 32B, while the `fileLengths = Array(10, 100

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-01 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-39174557 @mengxr Ah... I totally understand your mean. Code is on the way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-04-01 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39208378 Hi @mateiz @mengxr , what do you think about the test? Besides, we could also judge it from the hadoop-common code of [`CombineFileInputFormat`](https://github.com

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-01 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11161910 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/rdd/VectorRDDFunctionsSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Merge Hadoop Into Spark

2014-04-01 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/286#issuecomment-39281046 Yep, I find that each time I do `sbt clean gen-idea` or `sbt update` or even `sbt testOnly xxx`, I can do the cooking, take a shower, and have a rest. --- If your

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-01 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11190187 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDFunctions.scala --- @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-01 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11191344 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDFunctions.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-04-01 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39286892 Sorry for the misoperation just now, I almost deleted the wrong file. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-04-02 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/252#discussion_r11193752 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -372,6 +373,37 @@ class SparkContext( } /** + * Read

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-04-02 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39294300 How about textFiles() ? @liancheng recommended it just now. 2014-4-2 PM2:40ÓÚ Patrick Wendell notificati...@github.comдµÀ£º sc.textFileRecords

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-02 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11196822 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/rdd/VectorRDDFunctionsSuite.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-02 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11197597 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDFunctions.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-02 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11210688 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/rdd/VectorRDDFunctionsSuite.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-04-02 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39396075 Yep. Vote for `wholeTextFiles` too. Let me fix these now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-02 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11238923 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDFunctions.scala --- @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-03 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-39527499 @mengxr Yes, I think `RowRDDMatrix` is a good position. Just put this method together with SVD and PCA. Indeed, `RDD[Vector]` is a kind of matrix. What should I

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

2014-04-04 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/252#issuecomment-39623475 Thanks @mateiz and @mengxr ! I'll take care of the new issue. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-08 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-39822920 @mengxr Yep, I have substituted the population variance with sample variance. See line 97. in VectorRDDStatistics. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-08 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11382104 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDFunctions.scala --- @@ -0,0 +1,208 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-09 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-39941843 Sure, I'll do it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11427699 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/VectorRDDFunctions.scala --- @@ -0,0 +1,208 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-09 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-39954694 Well, the `git rebase` is very tricky... @mengxr You can have a look. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11431836 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -28,6 +28,171 @@ import org.apache.spark.rdd.RDD import

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-09 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/376 [SPARK-1415] Hadoop min split for wholeTextFiles() JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-1415). New Hadoop API of `InputFormat` does not provide the `minSplits

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-09 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/376#issuecomment-40043078 @mateiz , I have to modify some APIs so as to add the `minSplits`. I am not sure whether the modification is good or not. Could you have a look at it? --- If your

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/376#discussion_r11469988 --- Diff: core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala --- @@ -44,4 +47,15 @@ private[spark] class WholeTextFileInputFormat

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/268#discussion_r11470053 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -19,13 +19,144 @@ package

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/376#discussion_r11470152 --- Diff: core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala --- @@ -44,4 +47,15 @@ private[spark] class WholeTextFileInputFormat

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/376#discussion_r11470220 --- Diff: core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala --- @@ -44,4 +47,15 @@ private[spark] class WholeTextFileInputFormat

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-09 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/376#issuecomment-40044469 How about to add a subclass called `WholeTextFileRDD` extends from `NewHadoopRDD`, and use the `setMaxSplitSize` only for this subclass? --- If your project is set up

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-10 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-40051673 @mateiz I have fixed the issues. You can merge it if looks good to you. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

2014-04-10 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/268#issuecomment-40158443 Conflict in MLUtils and RowMatrix. I think it is OK now. 2014-4-11 AM5:31于 Matei Zaharia notificati...@github.com写道: Hey, unfortunately

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-12 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/376#issuecomment-40273076 @mateiz I have to admit that I ignore the importance of providing the `minSplits`. I encountered a problem just now. I have 20,000 files and call `wholeTextFiles(dir

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-12 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/376#issuecomment-40299867 @mateiz Yep, agree with you. The test failed caused by `org.apache.spark.streaming.CheckpointSuite`. Does it an occasionally error? Maybe I should rebase

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

2014-04-13 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/376#issuecomment-40307008 Well.. I got this two wired errors. Build time out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-04-16 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/166#issuecomment-40668271 I rewrite the 2 versions of `GradientDescent` with `Vector` instead of `Array`. Lasso is easy to test now thanks for @mengxr 's refactoring of code. I run

[GitHub] spark pull request: Fixed broken pyspark shell.

2014-04-17 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/444#issuecomment-40788214 Yep, I think python shell's document should be update same time. sys.version_info only became a named tuple in 2.7. To get this to work in 2.6, it needs to be accessed

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-04-21 Thread yinxusen
Github user yinxusen closed the pull request at: https://github.com/apache/spark/pull/166 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [WIP] [MLLIB-28] An optimized GradientDescent ...

2014-04-21 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/166#issuecomment-40922790 I'd like to close the PR, for the offline discussion with @mengxr . The code will be stay in my github repo, for those who still interested in it. --- If your project

[GitHub] spark pull request: fix bugs of dot in python

2014-04-21 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/463 fix bugs of dot in python If there are no `transpose()` in `self.theta`, a *ValueError: matrices are not aligned* is occurring. The former test case just ignore this situation

[GitHub] spark pull request: [SPARK-1506][MLLIB] Documentation improvements...

2014-04-21 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/422#issuecomment-40984061 Several comments: == code here (scala code) http://54.82.240.23:4000/mllib-linear-methods.html#linear-support-vector-machine-svm

[GitHub] spark pull request: [SPARK-1506][MLLIB] Documentation improvements...

2014-04-21 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/422#issuecomment-40985126 Append 2 unsolved problems: Code here (python code): http://54.82.240.23:4000/mllib-clustering.html `clusters = KMeans.train(parsedData, 2, maxIterations

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833024 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833060 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833194 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala --- @@ -87,6 +85,49 @@ class LassoWithSGD private

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833231 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala --- @@ -189,3 +230,70 @@ object LassoWithSGD { sc.stop

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833249 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala --- @@ -189,3 +230,70 @@ object LassoWithSGD { sc.stop

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833279 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala --- @@ -189,3 +230,70 @@ object LassoWithSGD { sc.stop

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r11833298 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/LassoSuite.scala --- @@ -44,8 +44,11 @@ class LassoSuite extends FunSuite

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-21 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/458#issuecomment-41000185 Cool, could you share your data-generator code to me, and let me take care of the `Nan` problem? Besides, could you provide the total running time of SGD and ADMM when

[GitHub] spark pull request: JIRA issue: [SPARK-1405](https://issues.apache...

2014-04-21 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/476 JIRA issue: [SPARK-1405](https://issues.apache.org/jira/browse/SPARK-1405) Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib (This PR is based on a joint work done with @liancheng

[GitHub] spark pull request: MLlib doc update for breeze dependency

2014-04-22 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/481#issuecomment-41012462 @dbtsai , @mengxr is improving mllib document for spark 1.0. So the documents will be ready recently. See here https://github.com/apache/spark/pull/422 . --- If your

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-23 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/458#issuecomment-41148756 @coderxiang I do some experiments on your dataset. * For MLlib, you should first rewrite your labels {+1, -1} into {+1, 0}. [Reference here](http://54.82.240.23:4000

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-24 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/458#issuecomment-41262172 I do the preprocess of your data, make it with zero-mean, unit norm. But Lasso also performances poorly, with Infinity results or rising losses. Since Lasso

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023334 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023346 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023345 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023354 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023367 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023374 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023381 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023406 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1543][MLlib] Add ADMM for solving Lasso...

2014-04-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/458#discussion_r12023410 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/ADMMLasso.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12127051 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12127214 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-30 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12132841 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-30 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41772255 Yep, thanks @jegonzal and @etrain , I'll try to fix these issues and look forward to the next step updating and discussion. --- If your project is set up for it, you

[GitHub] spark pull request: [SPARK-6226][MLLIB] add save/load in PySpark's...

2015-03-16 Thread yinxusen
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/5049#issuecomment-82011218 @mengxr Don't we need extra unittest? Does doctest well enough? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-6226][MLLIB] add save/load in PySpark's...

2015-03-16 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/5049#discussion_r26542049 --- Diff: python/pyspark/mllib/common.py --- @@ -70,8 +70,8 @@ def _py2java(sc, obj): obj = _to_java_object_rdd(obj) elif isinstance

[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means

2015-03-09 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/4951 [SPARK-5986][MLLib] Add save/load for k-means This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR. You can merge this pull request into a Git

[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means

2015-03-09 Thread yinxusen
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/4951#discussion_r26083656 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -58,4 +66,59 @@ class KMeansModel (val clusterCenters: Array

[GitHub] spark pull request: [SPARK-6526][ML] Add Normalizer transformer in...

2015-03-25 Thread yinxusen
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/5181 [SPARK-6526][ML] Add Normalizer transformer in ML package See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526). @mengxr Should we add test suite for this transformer

  1   2   3   4   5   6   7   8   9   10   >