[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-30 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1698 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-29 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-53944079 @andy327 Do you mind closing this PR for now? I'm definitely buying the idea of freeing up the master, but the current set of Core APIs doesn't provide an easy and efficie

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50900786 Yes, I tried to implement AllReduce without having driver in the middle in https://github.com/apache/spark/pull/506 but it introduced complex dependencies. So I fall back

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50900320 i can see your point of 10M columns. would be really nice if we have a lazy and efficient allReduce(RDD[T], (T, T) => T): RDD[T] a RDD transform

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50899568 What if you have 10M columns? I agree that not sending data to the driver is a good practice. But the current operations `reduceByKey` and `cartesian` are not optimized fo

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50896323 why do you use treeReduce + broadcast? the data per partition is small no? only a few aggregates per partition --- If your project is set up for it, you can reply t

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50892213 They are not the same. We use treeReduce to avoid having all executors sending data to the driver, which is not available in reduceByKey. Broadcast is also different from

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50890613 redudeByKey being the same as reduce, and cartesian being the same as broadcast is the whole point, the difference being that redudeByKey and cartesian are evaluated

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50849409 Your implementation calls `reduceByKey` and `cartesian`. Those are not cheap streamline operations. `map(x => (1, x)).reduceByKey` is the same as `reduce` except that it r

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread andy327
Github user andy327 commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50821196 I see that #1207 covers re-scaling in mllib.util.FeatureScaling, but from what I can tell, it calls RowMatrix.computeColumnSummaryStatistics, making it not a lazy transfo

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50814109 @andy327 This is covered in @dbtsai's PR: https://github.com/apache/spark/pull/1207 , which is in review. --- If your project is set up for it, you can reply to this emai

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread andy327
Github user andy327 commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50804551 See Jira issue: https://issues.apache.org/jira/browse/SPARK-2776 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50801763 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread andy327
GitHub user andy327 opened a pull request: https://github.com/apache/spark/pull/1698 Add normalizeByCol method to mllib.util.MLUtils. Adds the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only