[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59319145
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21790/consoleFull)
 for   PR 2819 at commit 
[`a405ae7`](https://github.com/apache/spark/commit/a405ae7b967a1a9398e3cdbb812149be7314f29e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorTransformer(object):`
  * `class Normalizer(VectorTransformer):`
  * `class JavaModelWrapper(VectorTransformer):`
  * `class StandardScalerModel(JavaModelWrapper):`
  * `class StandardScaler(object):`
  * `class HashingTF(object):`
  * `class IDFModel(JavaModelWrapper):`
  * `class IDF(object):`
  * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59319148
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21790/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2819

[SPARK-3961] Python API for mllib.feature

Added completed Python API for MLlib.feature

Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark feature

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2819


commit 8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5
Author: Davies Liu davies@gmail.com
Date:   2014-10-16T00:02:16Z

Python API for mllib.feature




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59296810
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21784/consoleFull)
 for   PR 2819 at commit 
[`8a50584`](https://github.com/apache/spark/commit/8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2819#discussion_r18932327
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -95,90 +360,46 @@ class Word2Vec(object):
  sentence = a b  * 100 + a c  * 10
  localDoc = [sentence, sentence]
  doc = sc.parallelize(localDoc).map(lambda line: line.split( ))
- model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
+ model = Word2Vec(vectorSize=10).fit(doc)
+
  syms = model.findSynonyms(a, 2)
- str(syms[0][0])
-'b'
- str(syms[1][0])
-'c'
- len(syms)
-2
+ [s[0] for s in syms]
+[u'b', u'c']
  vec = model.transform(a)
- len(vec)
-10
  syms = model.findSynonyms(vec, 2)
- str(syms[0][0])
-'b'
- str(syms[1][0])
-'c'
- len(syms)
-2
+ [s[0] for s in syms]
+[u'b', u'c']
 
-def __init__(self):
+def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1,
+ numIterations=1, seed=42L):
 
 Construct Word2Vec instance
-
-self.vectorSize = 100
-self.learningRate = 0.025
-self.numPartitions = 1
-self.numIterations = 1
-self.seed = 42L
 
-def setVectorSize(self, vectorSize):
-
-Sets vector size (default: 100).
+:param vectorSize: vector size (default: 100).
+:param learningRate:  initial learning rate (default: 0.025).
+:param numPartitions: number of partitions (default: 1). Use
+  a small number for accuracy.
+:param numIterations: number of iterations (default: 1), which 
should
+  be smaller than or equal to number of 
partitions.
 
--- End diff --

In Scala/Java Word2Vec implementation , we used setters to set parameters, 
should we keep the same interface at python side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59298672
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21785/consoleFull)
 for   PR 2819 at commit 
[`486795f`](https://github.com/apache/spark/commit/486795f1d8792c15c9f97b22b1015b23fb7c8d81).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2819#discussion_r18932968
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -95,90 +360,46 @@ class Word2Vec(object):
  sentence = a b  * 100 + a c  * 10
  localDoc = [sentence, sentence]
  doc = sc.parallelize(localDoc).map(lambda line: line.split( ))
- model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
+ model = Word2Vec(vectorSize=10).fit(doc)
+
  syms = model.findSynonyms(a, 2)
- str(syms[0][0])
-'b'
- str(syms[1][0])
-'c'
- len(syms)
-2
+ [s[0] for s in syms]
+[u'b', u'c']
  vec = model.transform(a)
- len(vec)
-10
  syms = model.findSynonyms(vec, 2)
- str(syms[0][0])
-'b'
- str(syms[1][0])
-'c'
- len(syms)
-2
+ [s[0] for s in syms]
+[u'b', u'c']
 
-def __init__(self):
+def __init__(self, vectorSize=100, learningRate=0.025, numPartitions=1,
+ numIterations=1, seed=42L):
 
 Construct Word2Vec instance
-
-self.vectorSize = 100
-self.learningRate = 0.025
-self.numPartitions = 1
-self.numIterations = 1
-self.seed = 42L
 
-def setVectorSize(self, vectorSize):
-
-Sets vector size (default: 100).
+:param vectorSize: vector size (default: 100).
+:param learningRate:  initial learning rate (default: 0.025).
+:param numPartitions: number of partitions (default: 1). Use
+  a small number for accuracy.
+:param numIterations: number of iterations (default: 1), which 
should
+  be smaller than or equal to number of 
partitions.
 
--- End diff --

It's good to have same interface crossing languages, but sometimes it looks 
wired to having the API that is designed for Java.

I'd like to simply the Python API a little bit (without introducing 
confusing), then Python programmer can feel better (in this case). We can find 
several similar cases in APIs of pyspark.RDD.

Does it make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59300862
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21784/consoleFull)
 for   PR 2819 at commit 
[`8a50584`](https://github.com/apache/spark/commit/8a50584ed6ea38b5fccc64e6da3fc18d4513c9c5).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorTransformer(object):`
  * `class Normalizer(VectorTransformer):`
  * `class JavaModelWrapper(VectorTransformer):`
  * `class StandardScalerModel(JavaModelWrapper):`
  * `class StandardScaler(object):`
  * `class HashTF(object):`
  * `class IDFModel(JavaModelWrapper):`
  * `class IDF(object):`
  * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59300866
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21784/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59302533
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21785/consoleFull)
 for   PR 2819 at commit 
[`486795f`](https://github.com/apache/spark/commit/486795f1d8792c15c9f97b22b1015b23fb7c8d81).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorTransformer(object):`
  * `class Normalizer(VectorTransformer):`
  * `class JavaModelWrapper(VectorTransformer):`
  * `class StandardScalerModel(JavaModelWrapper):`
  * `class StandardScaler(object):`
  * `class HashingTF(object):`
  * `class IDFModel(JavaModelWrapper):`
  * `class IDF(object):`
  * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59302540
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21785/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59310073
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21789/consoleFull)
 for   PR 2819 at commit 
[`7a1891a`](https://github.com/apache/spark/commit/7a1891abe6647a5f9dc82c21add907fe2d4b9aa8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59313284
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21789/consoleFull)
 for   PR 2819 at commit 
[`7a1891a`](https://github.com/apache/spark/commit/7a1891abe6647a5f9dc82c21add907fe2d4b9aa8).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorTransformer(object):`
  * `class Normalizer(VectorTransformer):`
  * `class JavaModelWrapper(VectorTransformer):`
  * `class StandardScalerModel(JavaModelWrapper):`
  * `class StandardScaler(object):`
  * `class HashingTF(object):`
  * `class IDFModel(JavaModelWrapper):`
  * `class IDF(object):`
  * `class Word2VecModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59313288
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21789/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2819#issuecomment-59314990
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21790/consoleFull)
 for   PR 2819 at commit 
[`a405ae7`](https://github.com/apache/spark/commit/a405ae7b967a1a9398e3cdbb812149be7314f29e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org