[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20777


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-16 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r175184951
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * Specifies the maximum number of different documents a term could 
appear in to be included
+   * in the vocabulary. A term that appears more than the threshold will 
be ignored. If this is an
+   * integer greater than or equal to 1, this specifies the maximum number 
of documents the term
+   * could appear in; if this is a double in [0,1), then this specifies 
the maximum fraction of
+   * documents the term could appear in.
--- End diff --

Thanks @srowen !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-16 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r175184795
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * Specifies the maximum number of different documents a term could 
appear in to be included
+   * in the vocabulary. A term that appears more than the threshold will 
be ignored. If this is an
+   * integer greater than or equal to 1, this specifies the maximum number 
of documents the term
+   * could appear in; if this is a double in [0,1), then this specifies 
the maximum fraction of
+   * documents the term could appear in.
*
-   * Default: (2^64^) - 1
+   * Default: (2^63) - 1
--- End diff --

If you can, it's always a good idea to generate the docs to make sure of 
any changes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-16 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r175184503
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * Specifies the maximum number of different documents a term could 
appear in to be included
+   * in the vocabulary. A term that appears more than the threshold will 
be ignored. If this is an
+   * integer greater than or equal to 1, this specifies the maximum number 
of documents the term
+   * could appear in; if this is a double in [0,1), then this specifies 
the maximum fraction of
+   * documents the term could appear in.
*
-   * Default: (2^64^) - 1
+   * Default: (2^63) - 1
--- End diff --

I think the format for scaladoc actually needs the extra '^' to display 
right, see the `vocabSize` default.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-15 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174935305
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -465,26 +473,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable,
 " Default False", typeConverter=TypeConverters.toBoolean)
 
 @keyword_only
-def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, 
binary=False, inputCol=None,
- outputCol=None):
+def __init__(self, minTF=1.0, minDF=1.0, maxDF=sys.maxsize, 
vocabSize=1 << 18, binary=False,
--- End diff --

Will make the change now. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174911085
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * Specifies the maximum number of different documents a term could 
appear in to be included
+   * in the vocabulary. A term that appears more than the threshold will 
be ignored. If this is an
+   * integer greater than or equal to 1, this specifies the maximum number 
of documents the term
+   * could appear in; if this is a double in [0,1), then this specifies 
the maximum fraction of
+   * documents the term could appear in.
--- End diff --

Agree, your wording is clearer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-15 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174864748
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -465,26 +473,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable,
 " Default False", typeConverter=TypeConverters.toBoolean)
 
 @keyword_only
-def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, 
binary=False, inputCol=None,
- outputCol=None):
+def __init__(self, minTF=1.0, minDF=1.0, maxDF=sys.maxsize, 
vocabSize=1 << 18, binary=False,
--- End diff --

I think it's best just to hardcode the value like you did before, 
`sys.maxsize` can be 32bit on some systems 
https://docs.python.org/3/library/sys.html#sys.maxsize


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-15 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174863155
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * Specifies the maximum number of different documents a term could 
appear in to be included
+   * in the vocabulary. A term that appears more than the threshold will 
be ignored. If this is an
+   * integer greater than or equal to 1, this specifies the maximum number 
of documents the term
+   * could appear in; if this is a double in [0,1), then this specifies 
the maximum fraction of
+   * documents the term could appear in.
--- End diff --

@srowen do these doc changes look ok to you?  It was a little confusing 
before saying that the term "must appear" when it's a max value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-14 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174636559
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self):
 feature, expected = r
 self.assertEqual(feature, expected)
 
+def test_count_vectorizer_with_maxDF(self):
+dataset = self.spark.createDataFrame([
+(0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 
1.0}),),
+(1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),),
+(2, "a b".split(' '), SparseVector(3, {0: 1.0}),),
+(3, "a".split(' '), SparseVector(3,  {}),)], ["id", "words", 
"expected"])
+cv = CountVectorizer(inputCol="words", outputCol="features")
+model1 = cv.setMaxDF(3).fit(dataset)
--- End diff --

Hi Bryan, Thanks for your comments. I will change these. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-14 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174627578
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self):
 feature, expected = r
 self.assertEqual(feature, expected)
 
+def test_count_vectorizer_with_maxDF(self):
+dataset = self.spark.createDataFrame([
+(0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 
1.0}),),
+(1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),),
+(2, "a b".split(' '), SparseVector(3, {0: 1.0}),),
+(3, "a".split(' '), SparseVector(3,  {}),)], ["id", "words", 
"expected"])
+cv = CountVectorizer(inputCol="words", outputCol="features")
+model1 = cv.setMaxDF(3).fit(dataset)
--- End diff --

Could you also add an assert that the vocabulary is equal to something?  I 
think it would be ['b', 'c' 'd']


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-14 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174626899
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self):
 feature, expected = r
 self.assertEqual(feature, expected)
 
+def test_count_vectorizer_with_maxDF(self):
+dataset = self.spark.createDataFrame([
+(0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 
1.0}),),
+(1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),),
+(2, "a b".split(' '), SparseVector(3, {0: 1.0}),),
+(3, "a".split(' '), SparseVector(3,  {}),)], ["id", "words", 
"expected"])
+cv = CountVectorizer(inputCol="words", outputCol="features")
+model1 = cv.setMaxDF(3).fit(dataset)
--- End diff --

Actually, I still don't think you setting the `maxDF` value is doing 
anything different to the model.  You want the test to fail if you do not set 
the value to 3.  I think to do this you will need to also assert that the 
vocabulary is equal to something


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-14 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174624206
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * maxDF is used for removing terms that appear too frequently. It 
specifies the maximum number
+   * of different documents a term could appear in to be included in the 
vocabulary.
+   * If this is an integer greater than or equal to 1, this specifies the 
maximum number of
+   * documents the term could appear in; if this is a double in [0,1), 
then this specifies the
+   * maximum fraction of documents the term could appear in. A term 
appears more frequently
+   * than maxDF will be removed.
--- End diff --

This sounds much better, but probably should use ignore instead of remove 
and might be good to just change the order of the sentence like this:

```
Specifies the maximum number of different documents a term could appear in 
to be included
in the vocabulary. A term that appears more than the threshold will be 
ignored. If this is an
integer greater than or equal to 1, this specifies the maximum number of 
documents the term
could appear in; if this is a double in [0,1), then this specifies the 
maximum fraction of
documents the term could appear in.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-14 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r174625203
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
   def getMinDF: Double = $(minDF)
 
   /**
-   * Specifies the maximum number of different documents a term must 
appear in to be included
-   * in the vocabulary.
-   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
-   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
-   * documents.
+   * maxDF is used for removing terms that appear too frequently. It 
specifies the maximum number
+   * of different documents a term could appear in to be included in the 
vocabulary.
+   * If this is an integer greater than or equal to 1, this specifies the 
maximum number of
+   * documents the term could appear in; if this is a double in [0,1), 
then this specifies the
+   * maximum fraction of documents the term could appear in. A term 
appears more frequently
+   * than maxDF will be removed.
*
-   * Default: (2^64^) - 1
+   * Default: (2^63) - 1
--- End diff --

good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-08 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r173369004
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -465,26 +522,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable,
 " Default False", typeConverter=TypeConverters.toBoolean)
 
 @keyword_only
-def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, 
binary=False, inputCol=None,
- outputCol=None):
+def __init__(self, minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, 
vocabSize=1 << 18, binary=False,
--- End diff --

Thank you very much for the comments. Will make changes. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r173336643
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -465,26 +522,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable,
 " Default False", typeConverter=TypeConverters.toBoolean)
 
 @keyword_only
-def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, 
binary=False, inputCol=None,
- outputCol=None):
+def __init__(self, minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, 
vocabSize=1 << 18, binary=False,
--- End diff --

I'm not crazy about hardcoding a value here since in Scala it is 
`Long.MaxValue`, but I'm not sure there is another way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r173336451
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -455,6 +506,12 @@ class CountVectorizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable,
 " If this is an integer >= 1, this specifies the number of 
documents the term must" +
 " appear in; if this is a double in [0,1), then this specifies the 
fraction of documents." +
 " Default 1.0", typeConverter=TypeConverters.toFloat)
+maxDF = Param(
+Params._dummy(), "maxDF", "Specifies the minimum number of" +
+" different documents a term must appear in to be included in the 
vocabulary." +
+" If this is an integer >= 1, this specifies the number of 
documents the term must" +
+" appear in; if this is a double in [0,1), then this specifies the 
fraction of documents." +
+" Default (2^63) - 1", typeConverter=TypeConverters.toFloat)
--- End diff --

I think this documentation is exactly the same as `minDF`, please refer to 
the scala docs.  Actually, I think the scala doc is a little confusing and 
could be clearer.  Would you like to take a shot at rewording it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20777#discussion_r173335895
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -408,35 +408,86 @@ class CountVectorizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable,
 """
 Extracts a vocabulary from document collections and generates a 
:py:attr:`CountVectorizerModel`.
 
->>> df = spark.createDataFrame(
+>>> df1 = spark.createDataFrame(
 ...[(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])],
 ...["label", "raw"])
->>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
->>> model = cv.fit(df)
->>> model.transform(df).show(truncate=False)
+>>> cv1 = CountVectorizer(inputCol="raw", outputCol="vectors")
+>>> model1 = cv1.fit(df1)
+>>> model1.transform(df1).show(truncate=False)
 +-+---+-+
 |label|raw|vectors  |
 +-+---+-+
 |0|[a, b, c]  |(3,[0,1,2],[1.0,1.0,1.0])|
 |1|[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
 +-+---+-+
 ...
->>> sorted(model.vocabulary) == ['a', 'b', 'c']
+>>> sorted(model1.vocabulary) == ['a', 'b', 'c']
 True
 >>> countVectorizerPath = temp_path + "/count-vectorizer"
->>> cv.save(countVectorizerPath)
+>>> cv1.save(countVectorizerPath)
 >>> loadedCv = CountVectorizer.load(countVectorizerPath)
->>> loadedCv.getMinDF() == cv.getMinDF()
+>>> loadedCv.getMinDF() == cv1.getMinDF()
 True
->>> loadedCv.getMinTF() == cv.getMinTF()
+>>> loadedCv.getMinTF() == cv1.getMinTF()
 True
->>> loadedCv.getVocabSize() == cv.getVocabSize()
+>>> loadedCv.getVocabSize() == cv1.getVocabSize()
 True
 >>> modelPath = temp_path + "/count-vectorizer-model"
->>> model.save(modelPath)
+>>> model1.save(modelPath)
 >>> loadedModel = CountVectorizerModel.load(modelPath)
->>> loadedModel.vocabulary == model.vocabulary
+>>> loadedModel.vocabulary == model1.vocabulary
 True
+>>> df2 = spark.createDataFrame(
+...[(0, ["a", "b", "c", "d"]), (1, ["a", "b", "c",]),(2, ["a", 
"b"]),(3, ["a"]),],
+...["label", "raw"])
+>>> cv2 = CountVectorizer(inputCol="raw", outputCol="vectors", maxDF=3)
+>>> model2 = cv2.fit(df2)
+>>> model2.transform(df2).show(truncate=False)
++-++-+
+|label|raw |vectors  |
++-++-+
+|0|[a, b, c, d]|(3,[0,1,2],[1.0,1.0,1.0])|
+|1|[a, b, c]   |(3,[0,1],[1.0,1.0])  |
+|2|[a, b]  |(3,[0],[1.0])|
+|3|[a] |(3,[],[])|
++-++-+
+...
+>>> cv3 = CountVectorizer(inputCol="raw", outputCol="vectors", 
maxDF=0.75)
+>>> model3 = cv3.fit(df2)
+>>> model3.transform(df2).show(truncate=False)
++-++-+
+|label|raw |vectors  |
++-++-+
+|0|[a, b, c, d]|(3,[0,1,2],[1.0,1.0,1.0])|
+|1|[a, b, c]   |(3,[0,1],[1.0,1.0])  |
+|2|[a, b]  |(3,[0],[1.0])|
+|3|[a] |(3,[],[])|
++-++-+
+...
+>>> cv4 = CountVectorizer(inputCol="raw", outputCol="vectors", 
minDF=2, maxDF=3)
+>>> model4 = cv4.fit(df2)
+>>> model4.transform(df2).show(truncate=False)
++-++---+
+|label|raw |vectors|
++-++---+
+|0|[a, b, c, d]|(2,[0,1],[1.0,1.0])|
+|1|[a, b, c]   |(2,[0,1],[1.0,1.0])|
+|2|[a, b]  |(2,[0],[1.0])  |
+|3|[a] |(2,[],[])  |
++-++---+
+...
+>>> cv5 = CountVectorizer(inputCol="raw", outputCol="vectors", 
minDF=0.5, maxDF=0.75)
+>>> model5 = cv5.fit(df2)
+>>> model5.transform(df2).show(truncate=False)
++-++---+
+|label|raw |vectors|
++-++---+
+|0|[a, b, c, d]|(2,[0,1],[1.0,1.0])|
+|1|[a, b, c]   |(2,[0,1],[1.0,1.0])|
+|2|[a, b]  |(2,[0],[1.0])  |
+|3|[a] |(2,[],[])  |
++-++---+
+...
--- End diff --

I thi

[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

2018-03-08 Thread huaxingao
GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/20777

[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer

## What changes were proposed in this pull request?

The maxDF parameter is for filtering out frequently occurring terms. This 
param was recently added to the Scala CountVectorizer and needs to be added to 
Python also.

## How was this patch tested?

add doctest


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-23615

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20777.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20777


commit cbf70bb9ff874af3b6fa76871798767c0174c266
Author: Huaxin Gao 
Date:   2018-03-08T22:29:32Z

[SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org