Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/1698
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enab
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-53944079
@andy327 Do you mind closing this PR for now? I'm definitely buying the
idea of freeing up the master, but the current set of Core APIs doesn't provide
an easy and efficie
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50900786
Yes, I tried to implement AllReduce without having driver in the middle in
https://github.com/apache/spark/pull/506 but it introduced complex
dependencies. So I fall back
Github user koertkuipers commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50900320
i can see your point of 10M columns.
would be really nice if we have a lazy and efficient allReduce(RDD[T], (T,
T) => T): RDD[T]
a RDD transform
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50899568
What if you have 10M columns? I agree that not sending data to the driver
is a good practice. But the current operations `reduceByKey` and `cartesian`
are not optimized fo
Github user koertkuipers commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50896323
why do you use treeReduce + broadcast? the data per partition is small no?
only a few aggregates per partition
---
If your project is set up for it, you can reply t
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50892213
They are not the same. We use treeReduce to avoid having all executors
sending data to the driver, which is not available in reduceByKey. Broadcast is
also different from
Github user koertkuipers commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50890613
redudeByKey being the same as reduce, and cartesian being the same as
broadcast is the whole point, the difference being that redudeByKey and
cartesian are evaluated
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50849409
Your implementation calls `reduceByKey` and `cartesian`. Those are not
cheap streamline operations. `map(x => (1, x)).reduceByKey` is the same as
`reduce` except that it r
Github user andy327 commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50821196
I see that #1207 covers re-scaling in mllib.util.FeatureScaling, but from
what I can tell, it calls RowMatrix.computeColumnSummaryStatistics, making it
not a lazy transfo
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50814109
@andy327 This is covered in @dbtsai's PR:
https://github.com/apache/spark/pull/1207 , which is in review.
---
If your project is set up for it, you can reply to this emai
Github user andy327 commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50804551
See Jira issue: https://issues.apache.org/jira/browse/SPARK-2776
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1698#issuecomment-50801763
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pro
GitHub user andy327 opened a pull request:
https://github.com/apache/spark/pull/1698
Add normalizeByCol method to mllib.util.MLUtils.
Adds the ability to compute the mean and standard deviations of each vector
(LabeledPoint) component and normalize each vector in the RDD, using only
14 matches
Mail list logo