[GitHub] spark pull request: [SPARK-2993] [MLLib] colStats (wrapper around ...

2014-08-12 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1911 [SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatisticalSummary) in Statistics For both Scala and Python. The ser/de util functions were moved out of `PythonMLLibAPI

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16075494 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2937] Separate out samplyByKeyExact as ...

2014-08-09 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1866 [SPARK-2937] Separate out samplyByKeyExact as its own API in PairRDDFunction To enable Python consistency and `Experimental` label of the `sampleByKeyExact` API. You can merge this pull request

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16024688 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16024698 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16024886 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1733#issuecomment-51570348 @mengxr @jkbradley @falaki In case you guys haven't noticed, the latest version implements the discussed APIs. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16009653 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16009835 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala --- @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r16015802 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1733#issuecomment-51286506 @mengxr @ jkbradley @falaki PR ready for review now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1733#discussion_r15854474 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala --- @@ -89,4 +90,76 @@ object Statistics { */ @Experimental

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1713 [SPARK-2786][mllib] Python correlations You can merge this pull request into a Git repository by running: $ git pull https://github.com/dorx/spark pythonCorrelation Alternatively you can review

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1713#discussion_r15709151 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1713#discussion_r15717671 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -49,43 +49,48 @@ private[stat] trait Correlation

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1713#discussion_r15717902 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-01 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1733 [SPARK-2515][mllib] Chi Squared test You can merge this pull request into a Git repository by running: $ git pull https://github.com/dorx/spark chisquare Alternatively you can review and apply

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1710 [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation getRanks computes the wrong rank when numPartition = size in the input RDDs before this patch. added units to address this bug. You can

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1710#issuecomment-50845017 @mengxr I'd really appreciate it if we can get this merged ASAP so I can send out my python correlation PR before the code freeze. Thanks! --- If your project is set up

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1710#discussion_r15681388 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala --- @@ -89,20 +89,17 @@ private[stat] object

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-30 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50663192 @JoshRosen any suggestions on what to do for the `random` name collision issue? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-30 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50664227 The simple, but perhaps not most elegant solution is adding the following inside of pyspark/\__init\__.py: ``` import sys, importlib s = sys.path.pop(0

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-30 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50679008 @JoshRosen tried it inside mllib/\__init\__.py and pyspark/\__init__.py and still get the import error when trying to run anything inside of mllib. --- If your project

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-30 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50689664 @JoshRosen yep I was also able to force it to work with an unnecessary import from pyspark.context to force it to import python's random first. The problem is now importing

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-29 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50525518 In NumPy's source, they had a directory named random: https://github.com/numpy/numpy/tree/master/numpy/random It seems like having directory hierarchy is the only way

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-29 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1628#discussion_r15563853 --- Diff: python/pyspark/mllib/random/RandomRDDGenerators.py --- @@ -0,0 +1,201 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-29 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50567408 Btw `from pyspark.mllib import random` now works with the latest commit in the pyspark shell. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-28 Thread dorx
GitHub user dorx reopened a pull request: https://github.com/apache/spark/pull/1025 [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD

[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-28 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1628 [SPARK-2724] Python version of RandomRDDGenerators RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator. `randomRDD.py

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-25 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15423644 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/random/DistributionGeneratorSuite.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-25 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1520#issuecomment-50213475 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-25 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1581#issuecomment-50213453 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-25 Thread dorx
Github user dorx closed the pull request at: https://github.com/apache/spark/pull/1025 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-24 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1025#issuecomment-50071767 Looks like there's some API changes from Xiangrui's updates. @mateiz @pwendell --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-24 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1025#issuecomment-50073867 Also, seems like there wasn't a single line of code preserved from before the updates. We should probably close this PR and let Xiangrui submit his version in a separate PR

[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1581 [SPARK-2679] [MLLib] Ser/De for Double Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib. You can merge this pull request into a Git repository by running

[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1581#issuecomment-50091128 @falaki @mengxr Created a separate PR for this so I can use it in both the python correlation and python randomRDD additions. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1581#issuecomment-50092950 @mengxr Given the current list of supported types, no, but if someone down the road adds Long or arrays of chars/shorts, etc, which isn't far-fetched, then it becomes

[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1581#issuecomment-50101155 The issue is what other things we can reasonably serialize into 8 bytes. Not sure how other types of doubles are relevant here since the size would be different and cause

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-23 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15305960 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-23 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15307029 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-23 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1554 [SPARK-2656] Python version of stratified sampling exact sample size not supported for now. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dorx

[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-23 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1554#issuecomment-49951984 @mengxr @falaki --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15262023 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15265484 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15265778 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15265929 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15266470 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15266943 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-21 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49683184 @dbtsai this is awesome! I actually created a JIRA on this after trying to use TestUtils in one of my unit suites, but it looks like you're already taking care

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-21 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1520 [SPARK-2514] [mllib] Random RDD generator Utilities for generating random RDDs. RandomRDD and RandomVectorRDD are created instead of using `sc.parallelize(range:Range)` because `Range

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-21 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1520#issuecomment-49695577 @falaki @jkbradley @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: Fixed a typo in the comments in RangePartition...

2014-07-20 Thread dorx
Github user dorx closed the pull request at: https://github.com/apache/spark/pull/1473 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: Fixed a typo in the comments in RangePartition...

2014-07-18 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1473#discussion_r15098158 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -135,7 +135,7 @@ class RangePartitioner[K : Ordering : ClassTag, V]( val k

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-18 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15098446 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-18 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15098523 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: Fixed a typo in the comments in RangePartition...

2014-07-18 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1473#issuecomment-49466619 A superficial look at the failed unit tests seems to suggest some Spark SQL optimizations rely on the fact that 1000 is set as the sequential scan threshhold. @rxin

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-18 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15135411 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: Fixed a typo in the comments in RangePartition...

2014-07-17 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1473 Fixed a typo in the comments in RangePartitioner Checked with Holden, the original author as per the log, and was told code is right comment is wrong. You can merge this pull request into a Git

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15017783 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15017975 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15020400 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15022756 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15028681 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15030155 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15033178 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15036448 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15036918 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14896742 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14896912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14897552 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14899055 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14899116 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14900326 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14900429 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1367#issuecomment-48953501 @mengxr Thanks for the feedback. Can you respond to my followup questions before I update my PR? --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906065 --- Diff: pom.xml --- @@ -257,6 +257,11 @@ version1.5/version /dependency dependency +groupIdorg.apache.commons

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906349 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -195,6 +193,37 @@ class PairRDDFunctions[K, V](self: RDD[(K, V

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906412 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -195,6 +193,37 @@ class PairRDDFunctions[K, V](self: RDD[(K, V

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906680 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906754 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906825 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14906919 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14907155 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14907202 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14907670 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14907870 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14907896 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-11 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14836617 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-11 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14836715 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-11 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14836922 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-11 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14846509 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-10 Thread dorx
GitHub user dorx opened a pull request: https://github.com/apache/spark/pull/1367 [SPARK-2359][MLlib] Correlations Implementation for Pearson and Spearman's correlation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dorx/spark

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-10 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1367#issuecomment-48682999 @mengrx, @falaki, @jkbradley please take a look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-08 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14672589 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,335 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-08 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1025#issuecomment-48386518 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-08 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14688550 --- Diff: core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala --- @@ -45,11 +50,75 @@ private[spark] object SamplingUtils { val

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-08 Thread dorx
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1025#discussion_r14688633 --- Diff: core/src/main/scala/org/apache/spark/util/random/StratifiedSampler.scala --- @@ -0,0 +1,310 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-08 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1025#issuecomment-48419179 Holding out on updating the docs until the python version is supported. For the python version, any objections to using _jrdd to invoke the java version of sampleByKey

[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-08 Thread dorx
Github user dorx commented on the pull request: https://github.com/apache/spark/pull/1025#issuecomment-48419578 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

  1   2   >