[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20777 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r175184951 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * Specifies the maximum number of different documents a term could appear in to be included + * in the vocabulary. A term that appears more than the threshold will be ignored. If this is an + * integer greater than or equal to 1, this specifies the maximum number of documents the term + * could appear in; if this is a double in [0,1), then this specifies the maximum fraction of + * documents the term could appear in. --- End diff -- Thanks @srowen ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r175184795 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * Specifies the maximum number of different documents a term could appear in to be included + * in the vocabulary. A term that appears more than the threshold will be ignored. If this is an + * integer greater than or equal to 1, this specifies the maximum number of documents the term + * could appear in; if this is a double in [0,1), then this specifies the maximum fraction of + * documents the term could appear in. * - * Default: (2^64^) - 1 + * Default: (2^63) - 1 --- End diff -- If you can, it's always a good idea to generate the docs to make sure of any changes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r175184503 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * Specifies the maximum number of different documents a term could appear in to be included + * in the vocabulary. A term that appears more than the threshold will be ignored. If this is an + * integer greater than or equal to 1, this specifies the maximum number of documents the term + * could appear in; if this is a double in [0,1), then this specifies the maximum fraction of + * documents the term could appear in. * - * Default: (2^64^) - 1 + * Default: (2^63) - 1 --- End diff -- I think the format for scaladoc actually needs the extra '^' to display right, see the `vocabSize` default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174935305 --- Diff: python/pyspark/ml/feature.py --- @@ -465,26 +473,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, " Default False", typeConverter=TypeConverters.toBoolean) @keyword_only -def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None, - outputCol=None): +def __init__(self, minTF=1.0, minDF=1.0, maxDF=sys.maxsize, vocabSize=1 << 18, binary=False, --- End diff -- Will make the change now. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174911085 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * Specifies the maximum number of different documents a term could appear in to be included + * in the vocabulary. A term that appears more than the threshold will be ignored. If this is an + * integer greater than or equal to 1, this specifies the maximum number of documents the term + * could appear in; if this is a double in [0,1), then this specifies the maximum fraction of + * documents the term could appear in. --- End diff -- Agree, your wording is clearer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174864748 --- Diff: python/pyspark/ml/feature.py --- @@ -465,26 +473,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, " Default False", typeConverter=TypeConverters.toBoolean) @keyword_only -def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None, - outputCol=None): +def __init__(self, minTF=1.0, minDF=1.0, maxDF=sys.maxsize, vocabSize=1 << 18, binary=False, --- End diff -- I think it's best just to hardcode the value like you did before, `sys.maxsize` can be 32bit on some systems https://docs.python.org/3/library/sys.html#sys.maxsize --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174863155 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * Specifies the maximum number of different documents a term could appear in to be included + * in the vocabulary. A term that appears more than the threshold will be ignored. If this is an + * integer greater than or equal to 1, this specifies the maximum number of documents the term + * could appear in; if this is a double in [0,1), then this specifies the maximum fraction of + * documents the term could appear in. --- End diff -- @srowen do these doc changes look ok to you? It was a little confusing before saying that the term "must appear" when it's a max value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174636559 --- Diff: python/pyspark/ml/tests.py --- @@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self): feature, expected = r self.assertEqual(feature, expected) +def test_count_vectorizer_with_maxDF(self): +dataset = self.spark.createDataFrame([ +(0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 1.0}),), +(1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),), +(2, "a b".split(' '), SparseVector(3, {0: 1.0}),), +(3, "a".split(' '), SparseVector(3, {}),)], ["id", "words", "expected"]) +cv = CountVectorizer(inputCol="words", outputCol="features") +model1 = cv.setMaxDF(3).fit(dataset) --- End diff -- Hi Bryan, Thanks for your comments. I will change these. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174627578 --- Diff: python/pyspark/ml/tests.py --- @@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self): feature, expected = r self.assertEqual(feature, expected) +def test_count_vectorizer_with_maxDF(self): +dataset = self.spark.createDataFrame([ +(0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 1.0}),), +(1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),), +(2, "a b".split(' '), SparseVector(3, {0: 1.0}),), +(3, "a".split(' '), SparseVector(3, {}),)], ["id", "words", "expected"]) +cv = CountVectorizer(inputCol="words", outputCol="features") +model1 = cv.setMaxDF(3).fit(dataset) --- End diff -- Could you also add an assert that the vocabulary is equal to something? I think it would be ['b', 'c' 'd'] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174626899 --- Diff: python/pyspark/ml/tests.py --- @@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self): feature, expected = r self.assertEqual(feature, expected) +def test_count_vectorizer_with_maxDF(self): +dataset = self.spark.createDataFrame([ +(0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 1.0}),), +(1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),), +(2, "a b".split(' '), SparseVector(3, {0: 1.0}),), +(3, "a".split(' '), SparseVector(3, {}),)], ["id", "words", "expected"]) +cv = CountVectorizer(inputCol="words", outputCol="features") +model1 = cv.setMaxDF(3).fit(dataset) --- End diff -- Actually, I still don't think you setting the `maxDF` value is doing anything different to the model. You want the test to fail if you do not set the value to 3. I think to do this you will need to also assert that the vocabulary is equal to something --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174624206 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * maxDF is used for removing terms that appear too frequently. It specifies the maximum number + * of different documents a term could appear in to be included in the vocabulary. + * If this is an integer greater than or equal to 1, this specifies the maximum number of + * documents the term could appear in; if this is a double in [0,1), then this specifies the + * maximum fraction of documents the term could appear in. A term appears more frequently + * than maxDF will be removed. --- End diff -- This sounds much better, but probably should use ignore instead of remove and might be good to just change the order of the sentence like this: ``` Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. A term that appears more than the threshold will be ignored. If this is an integer greater than or equal to 1, this specifies the maximum number of documents the term could appear in; if this is a double in [0,1), then this specifies the maximum fraction of documents the term could appear in. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174625203 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * maxDF is used for removing terms that appear too frequently. It specifies the maximum number + * of different documents a term could appear in to be included in the vocabulary. + * If this is an integer greater than or equal to 1, this specifies the maximum number of + * documents the term could appear in; if this is a double in [0,1), then this specifies the + * maximum fraction of documents the term could appear in. A term appears more frequently + * than maxDF will be removed. * - * Default: (2^64^) - 1 + * Default: (2^63) - 1 --- End diff -- good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r173369004 --- Diff: python/pyspark/ml/feature.py --- @@ -465,26 +522,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, " Default False", typeConverter=TypeConverters.toBoolean) @keyword_only -def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None, - outputCol=None): +def __init__(self, minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, vocabSize=1 << 18, binary=False, --- End diff -- Thank you very much for the comments. Will make changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r173336643 --- Diff: python/pyspark/ml/feature.py --- @@ -465,26 +522,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, " Default False", typeConverter=TypeConverters.toBoolean) @keyword_only -def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None, - outputCol=None): +def __init__(self, minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, vocabSize=1 << 18, binary=False, --- End diff -- I'm not crazy about hardcoding a value here since in Scala it is `Long.MaxValue`, but I'm not sure there is another way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r173336451 --- Diff: python/pyspark/ml/feature.py --- @@ -455,6 +506,12 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, " If this is an integer >= 1, this specifies the number of documents the term must" + " appear in; if this is a double in [0,1), then this specifies the fraction of documents." + " Default 1.0", typeConverter=TypeConverters.toFloat) +maxDF = Param( +Params._dummy(), "maxDF", "Specifies the minimum number of" + +" different documents a term must appear in to be included in the vocabulary." + +" If this is an integer >= 1, this specifies the number of documents the term must" + +" appear in; if this is a double in [0,1), then this specifies the fraction of documents." + +" Default (2^63) - 1", typeConverter=TypeConverters.toFloat) --- End diff -- I think this documentation is exactly the same as `minDF`, please refer to the scala docs. Actually, I think the scala doc is a little confusing and could be clearer. Would you like to take a shot at rewording it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r173335895 --- Diff: python/pyspark/ml/feature.py --- @@ -408,35 +408,86 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, """ Extracts a vocabulary from document collections and generates a :py:attr:`CountVectorizerModel`. ->>> df = spark.createDataFrame( +>>> df1 = spark.createDataFrame( ...[(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ...["label", "raw"]) ->>> cv = CountVectorizer(inputCol="raw", outputCol="vectors") ->>> model = cv.fit(df) ->>> model.transform(df).show(truncate=False) +>>> cv1 = CountVectorizer(inputCol="raw", outputCol="vectors") +>>> model1 = cv1.fit(df1) +>>> model1.transform(df1).show(truncate=False) +-+---+-+ |label|raw|vectors | +-+---+-+ |0|[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])| |1|[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])| +-+---+-+ ... ->>> sorted(model.vocabulary) == ['a', 'b', 'c'] +>>> sorted(model1.vocabulary) == ['a', 'b', 'c'] True >>> countVectorizerPath = temp_path + "/count-vectorizer" ->>> cv.save(countVectorizerPath) +>>> cv1.save(countVectorizerPath) >>> loadedCv = CountVectorizer.load(countVectorizerPath) ->>> loadedCv.getMinDF() == cv.getMinDF() +>>> loadedCv.getMinDF() == cv1.getMinDF() True ->>> loadedCv.getMinTF() == cv.getMinTF() +>>> loadedCv.getMinTF() == cv1.getMinTF() True ->>> loadedCv.getVocabSize() == cv.getVocabSize() +>>> loadedCv.getVocabSize() == cv1.getVocabSize() True >>> modelPath = temp_path + "/count-vectorizer-model" ->>> model.save(modelPath) +>>> model1.save(modelPath) >>> loadedModel = CountVectorizerModel.load(modelPath) ->>> loadedModel.vocabulary == model.vocabulary +>>> loadedModel.vocabulary == model1.vocabulary True +>>> df2 = spark.createDataFrame( +...[(0, ["a", "b", "c", "d"]), (1, ["a", "b", "c",]),(2, ["a", "b"]),(3, ["a"]),], +...["label", "raw"]) +>>> cv2 = CountVectorizer(inputCol="raw", outputCol="vectors", maxDF=3) +>>> model2 = cv2.fit(df2) +>>> model2.transform(df2).show(truncate=False) ++-++-+ +|label|raw |vectors | ++-++-+ +|0|[a, b, c, d]|(3,[0,1,2],[1.0,1.0,1.0])| +|1|[a, b, c] |(3,[0,1],[1.0,1.0]) | +|2|[a, b] |(3,[0],[1.0])| +|3|[a] |(3,[],[])| ++-++-+ +... +>>> cv3 = CountVectorizer(inputCol="raw", outputCol="vectors", maxDF=0.75) +>>> model3 = cv3.fit(df2) +>>> model3.transform(df2).show(truncate=False) ++-++-+ +|label|raw |vectors | ++-++-+ +|0|[a, b, c, d]|(3,[0,1,2],[1.0,1.0,1.0])| +|1|[a, b, c] |(3,[0,1],[1.0,1.0]) | +|2|[a, b] |(3,[0],[1.0])| +|3|[a] |(3,[],[])| ++-++-+ +... +>>> cv4 = CountVectorizer(inputCol="raw", outputCol="vectors", minDF=2, maxDF=3) +>>> model4 = cv4.fit(df2) +>>> model4.transform(df2).show(truncate=False) ++-++---+ +|label|raw |vectors| ++-++---+ +|0|[a, b, c, d]|(2,[0,1],[1.0,1.0])| +|1|[a, b, c] |(2,[0,1],[1.0,1.0])| +|2|[a, b] |(2,[0],[1.0]) | +|3|[a] |(2,[],[]) | ++-++---+ +... +>>> cv5 = CountVectorizer(inputCol="raw", outputCol="vectors", minDF=0.5, maxDF=0.75) +>>> model5 = cv5.fit(df2) +>>> model5.transform(df2).show(truncate=False) ++-++---+ +|label|raw |vectors| ++-++---+ +|0|[a, b, c, d]|(2,[0,1],[1.0,1.0])| +|1|[a, b, c] |(2,[0,1],[1.0,1.0])| +|2|[a, b] |(2,[0],[1.0]) | +|3|[a] |(2,[],[]) | ++-++---+ +... --- End diff -- I thi
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/20777 [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer ## What changes were proposed in this pull request? The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also. ## How was this patch tested? add doctest You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-23615 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20777.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20777 commit cbf70bb9ff874af3b6fa76871798767c0174c266 Author: Huaxin Gao Date: 2018-03-08T22:29:32Z [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org