[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59319145 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21790/consoleFull) for PR 2819 at commit [`a405ae7`](https://github.com/apache/spark/commit/a405ae7b967a1a9398e3cdbb812149be7314f29e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorTransformer(object):` * `class Normalizer(VectorTransformer):` * `class JavaModelWrapper(VectorTransformer):` * `class StandardScalerModel(JavaModelWrapper):` * `class StandardScaler(object):` * `class HashingTF(object):` * `class IDFModel(JavaModelWrapper):` * `class IDF(object):` * `class Word2VecModel(JavaModelWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59319148 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21790/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2819 [SPARK-3961] Python API for mllib.feature Added completed Python API for MLlib.feature Normalizer StandardScalerModel StandardScaler HashTF IDFModel IDF You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark feature Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2819.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2819 commit 8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5 Author: Davies Liu davies@gmail.com Date: 2014-10-16T00:02:16Z Python API for mllib.feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59296810 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21784/consoleFull) for PR 2819 at commit [`8a50584`](https://github.com/apache/spark/commit/8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2819#discussion_r18932327 --- Diff: python/pyspark/mllib/feature.py --- @@ -95,90 +360,46 @@ class Word2Vec(object): sentence = a b * 100 + a c * 10 localDoc = [sentence, sentence] doc = sc.parallelize(localDoc).map(lambda line: line.split( )) - model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc) + model = Word2Vec(vectorSize=10).fit(doc) + syms = model.findSynonyms(a, 2) - str(syms[0][0]) -'b' - str(syms[1][0]) -'c' - len(syms) -2 + [s[0] for s in syms] +[u'b', u'c'] vec = model.transform(a) - len(vec) -10 syms = model.findSynonyms(vec, 2) - str(syms[0][0]) -'b' - str(syms[1][0]) -'c' - len(syms) -2 + [s[0] for s in syms] +[u'b', u'c'] -def __init__(self): +def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1, + numIterations=1, seed=42L): Construct Word2Vec instance - -self.vectorSize = 100 -self.learningRate = 0.025 -self.numPartitions = 1 -self.numIterations = 1 -self.seed = 42L -def setVectorSize(self, vectorSize): - -Sets vector size (default: 100). +:param vectorSize: vector size (default: 100). +:param learningRate: initial learning rate (default: 0.025). +:param numPartitions: number of partitions (default: 1). Use + a small number for accuracy. +:param numIterations: number of iterations (default: 1), which should + be smaller than or equal to number of partitions. --- End diff -- In Scala/Java Word2Vec implementation , we used setters to set parameters, should we keep the same interface at python side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59298672 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21785/consoleFull) for PR 2819 at commit [`486795f`](https://github.com/apache/spark/commit/486795f1d8792c15c9f97b22b1015b23fb7c8d81). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2819#discussion_r18932968 --- Diff: python/pyspark/mllib/feature.py --- @@ -95,90 +360,46 @@ class Word2Vec(object): sentence = a b * 100 + a c * 10 localDoc = [sentence, sentence] doc = sc.parallelize(localDoc).map(lambda line: line.split( )) - model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc) + model = Word2Vec(vectorSize=10).fit(doc) + syms = model.findSynonyms(a, 2) - str(syms[0][0]) -'b' - str(syms[1][0]) -'c' - len(syms) -2 + [s[0] for s in syms] +[u'b', u'c'] vec = model.transform(a) - len(vec) -10 syms = model.findSynonyms(vec, 2) - str(syms[0][0]) -'b' - str(syms[1][0]) -'c' - len(syms) -2 + [s[0] for s in syms] +[u'b', u'c'] -def __init__(self): +def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1, + numIterations=1, seed=42L): Construct Word2Vec instance - -self.vectorSize = 100 -self.learningRate = 0.025 -self.numPartitions = 1 -self.numIterations = 1 -self.seed = 42L -def setVectorSize(self, vectorSize): - -Sets vector size (default: 100). +:param vectorSize: vector size (default: 100). +:param learningRate: initial learning rate (default: 0.025). +:param numPartitions: number of partitions (default: 1). Use + a small number for accuracy. +:param numIterations: number of iterations (default: 1), which should + be smaller than or equal to number of partitions. --- End diff -- It's good to have same interface crossing languages, but sometimes it looks wired to having the API that is designed for Java. I'd like to simply the Python API a little bit (without introducing confusing), then Python programmer can feel better (in this case). We can find several similar cases in APIs of pyspark.RDD. Does it make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59300862 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21784/consoleFull) for PR 2819 at commit [`8a50584`](https://github.com/apache/spark/commit/8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorTransformer(object):` * `class Normalizer(VectorTransformer):` * `class JavaModelWrapper(VectorTransformer):` * `class StandardScalerModel(JavaModelWrapper):` * `class StandardScaler(object):` * `class HashTF(object):` * `class IDFModel(JavaModelWrapper):` * `class IDF(object):` * `class Word2VecModel(JavaModelWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59300866 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21784/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59302533 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21785/consoleFull) for PR 2819 at commit [`486795f`](https://github.com/apache/spark/commit/486795f1d8792c15c9f97b22b1015b23fb7c8d81). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorTransformer(object):` * `class Normalizer(VectorTransformer):` * `class JavaModelWrapper(VectorTransformer):` * `class StandardScalerModel(JavaModelWrapper):` * `class StandardScaler(object):` * `class HashingTF(object):` * `class IDFModel(JavaModelWrapper):` * `class IDF(object):` * `class Word2VecModel(JavaModelWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59302540 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21785/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59310073 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21789/consoleFull) for PR 2819 at commit [`7a1891a`](https://github.com/apache/spark/commit/7a1891abe6647a5f9dc82c21add907fe2d4b9aa8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59313284 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21789/consoleFull) for PR 2819 at commit [`7a1891a`](https://github.com/apache/spark/commit/7a1891abe6647a5f9dc82c21add907fe2d4b9aa8). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorTransformer(object):` * `class Normalizer(VectorTransformer):` * `class JavaModelWrapper(VectorTransformer):` * `class StandardScalerModel(JavaModelWrapper):` * `class StandardScaler(object):` * `class HashingTF(object):` * `class IDFModel(JavaModelWrapper):` * `class IDF(object):` * `class Word2VecModel(JavaModelWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59313288 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21789/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2819#issuecomment-59314990 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21790/consoleFull) for PR 2819 at commit [`a405ae7`](https://github.com/apache/spark/commit/a405ae7b967a1a9398e3cdbb812149be7314f29e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org