[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14830 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101263394 --- Diff: examples/src/main/python/mllib/naive_bayes_example.py --- @@ -24,16 +24,17 @@ from __future__ import print_function +# $example on$ import shutil +# $example off$ from pyspark import SparkContext # $example on$ from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.util import MLUtils - - --- End diff -- yes, because the 2 empty lines are after # $example off$ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101263279 --- Diff: examples/src/main/python/mllib/streaming_linear_regression_example.py --- @@ -25,13 +25,14 @@ # $example off$ from pyspark import SparkContext -from pyspark.streaming import StreamingContext # $example on$ from pyspark.mllib.linalg import Vectors -from pyspark.mllib.regression import LabeledPoint -from pyspark.mllib.regression import StreamingLinearRegressionWithSGD +from pyspark.mllib.regression import (LabeledPoint, + StreamingLinearRegressionWithSGD) --- End diff -- I actually prefer having a single import per line (this simplifies a lot file management, multi branch merges,...). I can revert this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101263268 --- Diff: examples/src/main/python/mllib/naive_bayes_example.py --- @@ -24,16 +24,17 @@ from __future__ import print_function +# $example on$ import shutil +# $example off$ from pyspark import SparkContext # $example on$ from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.util import MLUtils - - --- End diff -- If you happen to be not able to build the python doc, I will check tomorrow to help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101263182 --- Diff: examples/src/main/python/streaming/network_wordjoinsentiments.py --- @@ -54,22 +54,25 @@ def print_happiest_words(rdd): # Read in the word-sentiment list and create a static RDD from it word_sentiments_file_path = "data/streaming/AFINN-111.txt" -word_sentiments = ssc.sparkContext.textFile(word_sentiments_file_path) \ -.map(lambda line: tuple(line.split("\t"))) +word_sentiments = (ssc.sparkContext + .textFile(word_sentiments_file_path) + .map(lambda line: tuple(line.split("\t" lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) -word_counts = lines.flatMap(lambda line: line.split(" ")) \ -.map(lambda word: (word, 1)) \ -.reduceByKey(lambda a, b: a + b) +word_counts = (lines + .flatMap(lambda line: line.split(" ")) + .map(lambda word: (word, 1)) + .reduceByKey(lambda a, b: a + b)) # Determine the words with the highest sentiment values by joining the streaming RDD # with the static RDD inside the transform() method and then multiplying # the frequency of the words by its sentiment value -happiest_words = word_counts.transform(lambda rdd: word_sentiments.join(rdd)) \ -.map(lambda word_tuples: (word_tuples[0], float(word_tuples[1][0]) * word_tuples[1][1])) \ -.map(lambda word_happiness: (word_happiness[1], word_happiness[0])) \ -.transform(lambda rdd: rdd.sortByKey(False)) +happiest_words = (word_counts + .map(lambda word_tuples: (word_tuples[0], +float(word_tuples[1][0]) * word_tuples[1][1])) --- End diff -- I agree, if you prefer I can change all at once. But like I said, I don't know any autoformat that does it automatically --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101262948 --- Diff: examples/src/main/python/ml/decision_tree_classification_example.py --- @@ -65,8 +67,9 @@ predictions.select("prediction", "indexedLabel", "features").show(5) # Select (prediction, true label) and compute test error -evaluator = MulticlassClassificationEvaluator( -labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy") --- End diff -- `pep8` tool can automatically do this return if line is > 100 char. There is indeed no preference between this format and: evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101262606 --- Diff: examples/src/main/python/ml/count_vectorizer_example.py --- @@ -17,23 +17,26 @@ from __future__ import print_function -from pyspark.sql import SparkSession # $example on$ from pyspark.ml.feature import CountVectorizer # $example off$ +from pyspark.sql import SparkSession + if __name__ == "__main__": -spark = SparkSession\ -.builder\ -.appName("CountVectorizerExample")\ -.getOrCreate() +spark = (SparkSession + .builder + .appName("CountVectorizerExample") + .getOrCreate()) # $example on$ # Input data: Each row is a bag of words with a ID. -df = spark.createDataFrame([ -(0, "a b c".split(" ")), -(1, "a b b c a".split(" ")) -], ["id", "words"]) +df = spark.createDataFrame( +[ +(0, "a b c".split(" ")), +(1, "a b b c a".split(" ")) +], --- End diff -- Indeed, this is a recommendation not an obligation. I see it to be more looking like Scala multi-line code, and I prefer it. It is a personal opinion, and I don't think there is a pylint/pep8 check to prevent using \. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101255529 --- Diff: docs/streaming-programming-guide.md --- @@ -1626,10 +1626,10 @@ See the full [source code]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_ # Lazily instantiated global instance of SparkSession def getSparkSessionInstance(sparkConf): if ("sparkSessionSingletonInstance" not in globals()): -globals()["sparkSessionSingletonInstance"] = SparkSession \ -.builder \ -.config(conf=sparkConf) \ -.getOrCreate() +globals()["sparkSessionSingletonInstance"] = (SparkSession --- End diff -- Maybe I am wrong. Could you maybe provide the reference? > recommended by pep8 Do you refer this line? >The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, >brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. >These should be used in preference to using a backslash for line continuation. I know the rule with binary operator follows this but I guess this case is not disallowed. I am not sure if it is worth sweeping all. They look preferred but not breaking pep8. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101256136 --- Diff: examples/src/main/python/ml/count_vectorizer_example.py --- @@ -17,23 +17,26 @@ from __future__ import print_function -from pyspark.sql import SparkSession # $example on$ from pyspark.ml.feature import CountVectorizer # $example off$ +from pyspark.sql import SparkSession + if __name__ == "__main__": -spark = SparkSession\ -.builder\ -.appName("CountVectorizerExample")\ -.getOrCreate() +spark = (SparkSession + .builder + .appName("CountVectorizerExample") + .getOrCreate()) # $example on$ # Input data: Each row is a bag of words with a ID. -df = spark.createDataFrame([ -(0, "a b c".split(" ")), -(1, "a b b c a".split(" ")) -], ["id", "words"]) +df = spark.createDataFrame( +[ +(0, "a b c".split(" ")), +(1, "a b b c a".split(" ")) +], --- End diff -- Could you double check if it really does not follow pep8? I have seen the removed syntax more often (e.g., `numpy`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101257317 --- Diff: examples/src/main/python/streaming/network_wordjoinsentiments.py --- @@ -54,22 +54,25 @@ def print_happiest_words(rdd): # Read in the word-sentiment list and create a static RDD from it word_sentiments_file_path = "data/streaming/AFINN-111.txt" -word_sentiments = ssc.sparkContext.textFile(word_sentiments_file_path) \ -.map(lambda line: tuple(line.split("\t"))) +word_sentiments = (ssc.sparkContext + .textFile(word_sentiments_file_path) + .map(lambda line: tuple(line.split("\t" lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) -word_counts = lines.flatMap(lambda line: line.split(" ")) \ -.map(lambda word: (word, 1)) \ -.reduceByKey(lambda a, b: a + b) +word_counts = (lines + .flatMap(lambda line: line.split(" ")) + .map(lambda word: (word, 1)) + .reduceByKey(lambda a, b: a + b)) # Determine the words with the highest sentiment values by joining the streaming RDD # with the static RDD inside the transform() method and then multiplying # the frequency of the words by its sentiment value -happiest_words = word_counts.transform(lambda rdd: word_sentiments.join(rdd)) \ -.map(lambda word_tuples: (word_tuples[0], float(word_tuples[1][0]) * word_tuples[1][1])) \ -.map(lambda word_happiness: (word_happiness[1], word_happiness[0])) \ -.transform(lambda rdd: rdd.sortByKey(False)) +happiest_words = (word_counts + .map(lambda word_tuples: (word_tuples[0], +float(word_tuples[1][0]) * word_tuples[1][1])) --- End diff -- multiline lines in `lambda`? Is this really recommanded in pep8? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101256694 --- Diff: examples/src/main/python/ml/decision_tree_classification_example.py --- @@ -65,8 +67,9 @@ predictions.select("prediction", "indexedLabel", "features").show(5) # Select (prediction, true label) and compute test error -evaluator = MulticlassClassificationEvaluator( -labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy") --- End diff -- Hm.. dose pep8 has a different argument location rule for class and function? It seems this one is already fine and seems inconsistent with https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101259038 --- Diff: examples/src/main/python/mllib/linear_regression_with_sgd_example.py --- @@ -22,9 +22,11 @@ from pyspark import SparkContext # $example on$ -from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel +from pyspark.mllib.regression import (LabeledPoint, LinearRegressionModel, + LinearRegressionWithSGD) --- End diff -- This one also does not look exceeding 100 length limit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101258928 --- Diff: examples/src/main/python/mllib/naive_bayes_example.py --- @@ -24,16 +24,17 @@ from __future__ import print_function +# $example on$ import shutil +# $example off$ from pyspark import SparkContext # $example on$ from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.util import MLUtils - - --- End diff -- Could I ask to check if the rendered example still complies pep8? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101258974 --- Diff: examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py --- @@ -22,9 +22,11 @@ from pyspark import SparkContext # $example on$ -from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel +from pyspark.mllib.classification import (LogisticRegressionModel, + LogisticRegressionWithLBFGS) --- End diff -- This one also does not look exceeding 100 length limit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101258832 --- Diff: examples/src/main/python/mllib/power_iteration_clustering_example.py --- @@ -19,7 +19,9 @@ from pyspark import SparkContext # $example on$ -from pyspark.mllib.clustering import PowerIterationClustering, PowerIterationClusteringModel +from pyspark.mllib.clustering import (PowerIterationClustering, + PowerIterationClusteringModel) --- End diff -- This one also seems not exceeding 100. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101258524 --- Diff: examples/src/main/python/mllib/streaming_linear_regression_example.py --- @@ -25,13 +25,14 @@ # $example off$ from pyspark import SparkContext -from pyspark.streaming import StreamingContext # $example on$ from pyspark.mllib.linalg import Vectors -from pyspark.mllib.regression import LabeledPoint -from pyspark.mllib.regression import StreamingLinearRegressionWithSGD +from pyspark.mllib.regression import (LabeledPoint, + StreamingLinearRegressionWithSGD) --- End diff -- This does not exceed 100 line length? Up to my knowledge Spark limits it 100 (not default 80). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r101245188 --- Diff: examples/src/main/python/ml/cross_validator.py --- @@ -42,20 +42,22 @@ # $example on$ # Prepare training documents, which are labeled. -training = spark.createDataFrame([ -(0, "a b c d e spark", 1.0), -(1, "b d", 0.0), -(2, "spark f g h", 1.0), -(3, "hadoop mapreduce", 0.0), -(4, "b spark who", 1.0), -(5, "g d a y", 0.0), -(6, "spark fly", 1.0), -(7, "was mapreduce", 0.0), -(8, "e spark program", 1.0), -(9, "a e c l", 0.0), -(10, "spark compile", 1.0), -(11, "hadoop software", 0.0) -], ["id", "text", "label"]) +training = spark.createDataFrame( +[ --- End diff -- It'd great if we have some references or quotes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93996278 --- Diff: examples/src/main/python/logistic_regression.py --- @@ -29,7 +29,6 @@ import numpy as np from pyspark.sql import SparkSession --- End diff -- Why did you remove the double newlines after the end of the imports? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93996113 --- Diff: examples/src/main/python/mllib/bisecting_k_means_example.py --- @@ -20,11 +20,11 @@ # $example on$ from numpy import array # $example off$ - from pyspark import SparkContext # $example on$ from pyspark.mllib.clustering import BisectingKMeans, BisectingKMeansModel # $example off$ +# --- End diff -- Whats this for? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r95129312 --- Diff: examples/src/main/python/ml/generalized_linear_regression_example.py --- @@ -17,9 +17,10 @@ from __future__ import print_function -from pyspark.sql import SparkSession # $example on$ from pyspark.ml.regression import GeneralizedLinearRegression +from pyspark.sql import SparkSession + --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93977216 --- Diff: examples/src/main/python/ml/cross_validator.py --- @@ -42,20 +42,22 @@ # $example on$ # Prepare training documents, which are labeled. -training = spark.createDataFrame([ -(0, "a b c d e spark", 1.0), -(1, "b d", 0.0), -(2, "spark f g h", 1.0), -(3, "hadoop mapreduce", 0.0), -(4, "b spark who", 1.0), -(5, "g d a y", 0.0), -(6, "spark fly", 1.0), -(7, "was mapreduce", 0.0), -(8, "e spark program", 1.0), -(9, "a e c l", 0.0), -(10, "spark compile", 1.0), -(11, "hadoop software", 0.0) -], ["id", "text", "label"]) +training = spark.createDataFrame( +[ --- End diff -- cannot remember if i did that manually or if it has been done by the `pep8` tool. I cannot work on it till next week --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93977062 --- Diff: examples/src/main/python/ml/generalized_linear_regression_example.py --- @@ -17,9 +17,10 @@ from __future__ import print_function -from pyspark.sql import SparkSession # $example on$ from pyspark.ml.regression import GeneralizedLinearRegression +from pyspark.sql import SparkSession + --- End diff -- indeed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93974494 --- Diff: examples/src/main/python/ml/generalized_linear_regression_example.py --- @@ -17,9 +17,10 @@ from __future__ import print_function -from pyspark.sql import SparkSession # $example on$ from pyspark.ml.regression import GeneralizedLinearRegression +from pyspark.sql import SparkSession + --- End diff -- It's weird to have one blank line inside of the example and one outside, generally we seem to be leaving the blank lines out of the others. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93613663 --- Diff: examples/src/main/python/ml/vector_slicer_example.py --- @@ -20,15 +20,18 @@ # $example on$ from pyspark.ml.feature import VectorSlicer from pyspark.ml.linalg import Vectors -from pyspark.sql.types import Row # $example off$ from pyspark.sql import SparkSession +# $example on$ --- End diff -- Oh OK, if this is for a particular reason and not just redundant, fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93613427 --- Diff: examples/src/main/python/ml/vector_slicer_example.py --- @@ -20,15 +20,18 @@ # $example on$ from pyspark.ml.feature import VectorSlicer from pyspark.ml.linalg import Vectors -from pyspark.sql.types import Row # $example off$ from pyspark.sql import SparkSession +# $example on$ --- End diff -- the SparkSession is usually hiden in other examples. I can add it you really want --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93474130 --- Diff: examples/src/main/python/ml/vector_slicer_example.py --- @@ -20,15 +20,18 @@ # $example on$ from pyspark.ml.feature import VectorSlicer from pyspark.ml.linalg import Vectors -from pyspark.sql.types import Row # $example off$ from pyspark.sql import SparkSession +# $example on$ --- End diff -- I think we can wrap the whole block in one set of example on/off tags --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93424188 --- Diff: docs/streaming-programming-guide.md --- @@ -2105,7 +2105,7 @@ documentation), or set the `spark.default.parallelism` {:.no_toc} The overheads of data serialization can be reduced by tuning the serialization formats. In the case of streaming, there are two types of data that are being serialized. -* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format. +* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format. --- End diff -- there is an extra space at the end of the line, github doesn't display it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93424082 --- Diff: docs/streaming-programming-guide.md --- @@ -1626,10 +1626,10 @@ See the full [source code]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_ # Lazily instantiated global instance of SparkSession def getSparkSessionInstance(sparkConf): if ("sparkSessionSingletonInstance" not in globals()): -globals()["sparkSessionSingletonInstance"] = SparkSession \ -.builder \ -.config(conf=sparkConf) \ -.getOrCreate() +globals()["sparkSessionSingletonInstance"] = (SparkSession --- End diff -- my point was on the use of parenthesis instead of thr backslash, which is recommended by pep8. I can keep the indentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93423843 --- Diff: examples/src/main/python/ml/dct_example.py --- @@ -23,6 +23,7 @@ # $example off$ from pyspark.sql import SparkSession + --- End diff -- pep8 recommends 2 empty lines after import statements. `pep8` does it automatically --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user Stibbons commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93423576 --- Diff: docs/streaming-programming-guide.md --- @@ -1099,7 +1099,7 @@ joinedStream = stream1.join(stream2) {% endhighlight %} -Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well. +Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well. --- End diff -- indeed, this is my sublime text that does it automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93419478 --- Diff: examples/src/main/python/ml/dct_example.py --- @@ -23,6 +23,7 @@ # $example off$ from pyspark.sql import SparkSession + --- End diff -- Is this needed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93419273 --- Diff: docs/streaming-programming-guide.md --- @@ -1099,7 +1099,7 @@ joinedStream = stream1.join(stream2) {% endhighlight %} -Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well. +Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well. --- End diff -- I'd avoid just taking whitespace of the end of lines. It's not related to your change, and just makes it a little harder to see what your changes are. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93419341 --- Diff: docs/streaming-programming-guide.md --- @@ -1626,10 +1626,10 @@ See the full [source code]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_ # Lazily instantiated global instance of SparkSession def getSparkSessionInstance(sparkConf): if ("sparkSessionSingletonInstance" not in globals()): -globals()["sparkSessionSingletonInstance"] = SparkSession \ -.builder \ -.config(conf=sparkConf) \ -.getOrCreate() +globals()["sparkSessionSingletonInstance"] = (SparkSession --- End diff -- OK. I don't feel so qualified to judge that, but take your word for it. However do you really want to indent this so much? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93419445 --- Diff: examples/src/main/python/ml/cross_validator.py --- @@ -42,20 +42,22 @@ # $example on$ # Prepare training documents, which are labeled. -training = spark.createDataFrame([ -(0, "a b c d e spark", 1.0), -(1, "b d", 0.0), -(2, "spark f g h", 1.0), -(3, "hadoop mapreduce", 0.0), -(4, "b spark who", 1.0), -(5, "g d a y", 0.0), -(6, "spark fly", 1.0), -(7, "was mapreduce", 0.0), -(8, "e spark program", 1.0), -(9, "a e c l", 0.0), -(10, "spark compile", 1.0), -(11, "hadoop software", 0.0) -], ["id", "text", "label"]) +training = spark.createDataFrame( +[ --- End diff -- Is this really a pep8 recommendation or just preference? I think the way it was is more consistent with other Spark code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14830#discussion_r93419361 --- Diff: docs/streaming-programming-guide.md --- @@ -2105,7 +2105,7 @@ documentation), or set the `spark.default.parallelism` {:.no_toc} The overheads of data serialization can be reduced by tuning the serialization formats. In the case of streaming, there are two types of data that are being serialized. -* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format. +* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format. --- End diff -- What is the change here? I can't make it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org