[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-08-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14830


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101263394
  
--- Diff: examples/src/main/python/mllib/naive_bayes_example.py ---
@@ -24,16 +24,17 @@
 
 from __future__ import print_function
 
+# $example on$
 import shutil
+# $example off$
 
 from pyspark import SparkContext
 # $example on$
 from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
 from pyspark.mllib.util import MLUtils
-
-
--- End diff --

yes, because the 2 empty lines are after 

 # $example off$


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101263279
  
--- Diff: 
examples/src/main/python/mllib/streaming_linear_regression_example.py ---
@@ -25,13 +25,14 @@
 # $example off$
 
 from pyspark import SparkContext
-from pyspark.streaming import StreamingContext
 # $example on$
 from pyspark.mllib.linalg import Vectors
-from pyspark.mllib.regression import LabeledPoint
-from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
+from pyspark.mllib.regression import (LabeledPoint,
+  StreamingLinearRegressionWithSGD)
--- End diff --

I actually prefer having a single import per line (this simplifies a lot 
file management, multi branch merges,...). I can revert this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101263268
  
--- Diff: examples/src/main/python/mllib/naive_bayes_example.py ---
@@ -24,16 +24,17 @@
 
 from __future__ import print_function
 
+# $example on$
 import shutil
+# $example off$
 
 from pyspark import SparkContext
 # $example on$
 from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
 from pyspark.mllib.util import MLUtils
-
-
--- End diff --

If you happen to be not able to build the python doc, I will check tomorrow 
to help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101263182
  
--- Diff: examples/src/main/python/streaming/network_wordjoinsentiments.py 
---
@@ -54,22 +54,25 @@ def print_happiest_words(rdd):
 
 # Read in the word-sentiment list and create a static RDD from it
 word_sentiments_file_path = "data/streaming/AFINN-111.txt"
-word_sentiments = ssc.sparkContext.textFile(word_sentiments_file_path) 
\
-.map(lambda line: tuple(line.split("\t")))
+word_sentiments = (ssc.sparkContext
+   .textFile(word_sentiments_file_path)
+   .map(lambda line: tuple(line.split("\t"
 
 lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
 
-word_counts = lines.flatMap(lambda line: line.split(" ")) \
-.map(lambda word: (word, 1)) \
-.reduceByKey(lambda a, b: a + b)
+word_counts = (lines
+   .flatMap(lambda line: line.split(" "))
+   .map(lambda word: (word, 1))
+   .reduceByKey(lambda a, b: a + b))
 
 # Determine the words with the highest sentiment values by joining the 
streaming RDD
 # with the static RDD inside the transform() method and then 
multiplying
 # the frequency of the words by its sentiment value
-happiest_words = word_counts.transform(lambda rdd: 
word_sentiments.join(rdd)) \
-.map(lambda word_tuples: (word_tuples[0], float(word_tuples[1][0]) 
* word_tuples[1][1])) \
-.map(lambda word_happiness: (word_happiness[1], 
word_happiness[0])) \
-.transform(lambda rdd: rdd.sortByKey(False))
+happiest_words = (word_counts
+  .map(lambda word_tuples: (word_tuples[0],
+float(word_tuples[1][0]) * 
word_tuples[1][1]))
--- End diff --

I agree, if you prefer I can change all at once. But like I said, I don't 
know any autoformat that does it automatically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101262948
  
--- Diff: 
examples/src/main/python/ml/decision_tree_classification_example.py ---
@@ -65,8 +67,9 @@
 predictions.select("prediction", "indexedLabel", "features").show(5)
 
 # Select (prediction, true label) and compute test error
-evaluator = MulticlassClassificationEvaluator(
-labelCol="indexedLabel", predictionCol="prediction", 
metricName="accuracy")
--- End diff --

`pep8` tool can automatically do this return if line is > 100 char. There 
is indeed no preference between this format and:

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel",
  
predictionCol="prediction",
  metricName="accuracy")



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101262606
  
--- Diff: examples/src/main/python/ml/count_vectorizer_example.py ---
@@ -17,23 +17,26 @@
 
 from __future__ import print_function
 
-from pyspark.sql import SparkSession
 # $example on$
 from pyspark.ml.feature import CountVectorizer
 # $example off$
+from pyspark.sql import SparkSession
+
 
 if __name__ == "__main__":
-spark = SparkSession\
-.builder\
-.appName("CountVectorizerExample")\
-.getOrCreate()
+spark = (SparkSession
+ .builder
+ .appName("CountVectorizerExample")
+ .getOrCreate())
 
 # $example on$
 # Input data: Each row is a bag of words with a ID.
-df = spark.createDataFrame([
-(0, "a b c".split(" ")),
-(1, "a b b c a".split(" "))
-], ["id", "words"])
+df = spark.createDataFrame(
+[
+(0, "a b c".split(" ")),
+(1, "a b b c a".split(" "))
+],
--- End diff --

Indeed, this is a recommendation not an obligation. I see it to be more 
looking like Scala multi-line code, and I prefer it. It is a personal opinion, 
and I don't think there is a pylint/pep8 check to prevent using \.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101255529
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -1626,10 +1626,10 @@ See the full [source 
code]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_
 # Lazily instantiated global instance of SparkSession
 def getSparkSessionInstance(sparkConf):
 if ("sparkSessionSingletonInstance" not in globals()):
-globals()["sparkSessionSingletonInstance"] = SparkSession \
-.builder \
-.config(conf=sparkConf) \
-.getOrCreate()
+globals()["sparkSessionSingletonInstance"] = (SparkSession
--- End diff --

Maybe I am wrong. Could you maybe provide the reference? 

> recommended by pep8

Do you refer this line?

>The preferred way of wrapping long lines is by using Python's implied line 
continuation inside parentheses, 
>brackets and braces. Long lines can be broken over multiple lines by 
wrapping expressions in parentheses. 
>These should be used in preference to using a backslash for line 
continuation.

I know the rule with binary operator follows this but I guess this case is 
not disallowed. I am not sure if it is worth sweeping all. They look preferred 
but not breaking pep8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101256136
  
--- Diff: examples/src/main/python/ml/count_vectorizer_example.py ---
@@ -17,23 +17,26 @@
 
 from __future__ import print_function
 
-from pyspark.sql import SparkSession
 # $example on$
 from pyspark.ml.feature import CountVectorizer
 # $example off$
+from pyspark.sql import SparkSession
+
 
 if __name__ == "__main__":
-spark = SparkSession\
-.builder\
-.appName("CountVectorizerExample")\
-.getOrCreate()
+spark = (SparkSession
+ .builder
+ .appName("CountVectorizerExample")
+ .getOrCreate())
 
 # $example on$
 # Input data: Each row is a bag of words with a ID.
-df = spark.createDataFrame([
-(0, "a b c".split(" ")),
-(1, "a b b c a".split(" "))
-], ["id", "words"])
+df = spark.createDataFrame(
+[
+(0, "a b c".split(" ")),
+(1, "a b b c a".split(" "))
+],
--- End diff --

Could you double check if it really does not follow pep8? I have seen the 
removed syntax more often (e.g., `numpy`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101257317
  
--- Diff: examples/src/main/python/streaming/network_wordjoinsentiments.py 
---
@@ -54,22 +54,25 @@ def print_happiest_words(rdd):
 
 # Read in the word-sentiment list and create a static RDD from it
 word_sentiments_file_path = "data/streaming/AFINN-111.txt"
-word_sentiments = ssc.sparkContext.textFile(word_sentiments_file_path) 
\
-.map(lambda line: tuple(line.split("\t")))
+word_sentiments = (ssc.sparkContext
+   .textFile(word_sentiments_file_path)
+   .map(lambda line: tuple(line.split("\t"
 
 lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
 
-word_counts = lines.flatMap(lambda line: line.split(" ")) \
-.map(lambda word: (word, 1)) \
-.reduceByKey(lambda a, b: a + b)
+word_counts = (lines
+   .flatMap(lambda line: line.split(" "))
+   .map(lambda word: (word, 1))
+   .reduceByKey(lambda a, b: a + b))
 
 # Determine the words with the highest sentiment values by joining the 
streaming RDD
 # with the static RDD inside the transform() method and then 
multiplying
 # the frequency of the words by its sentiment value
-happiest_words = word_counts.transform(lambda rdd: 
word_sentiments.join(rdd)) \
-.map(lambda word_tuples: (word_tuples[0], float(word_tuples[1][0]) 
* word_tuples[1][1])) \
-.map(lambda word_happiness: (word_happiness[1], 
word_happiness[0])) \
-.transform(lambda rdd: rdd.sortByKey(False))
+happiest_words = (word_counts
+  .map(lambda word_tuples: (word_tuples[0],
+float(word_tuples[1][0]) * 
word_tuples[1][1]))
--- End diff --

multiline lines in `lambda`? Is this really recommanded in pep8?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101256694
  
--- Diff: 
examples/src/main/python/ml/decision_tree_classification_example.py ---
@@ -65,8 +67,9 @@
 predictions.select("prediction", "indexedLabel", "features").show(5)
 
 # Select (prediction, true label) and compute test error
-evaluator = MulticlassClassificationEvaluator(
-labelCol="indexedLabel", predictionCol="prediction", 
metricName="accuracy")
--- End diff --

Hm.. dose pep8 has a different argument location rule for class and 
function? It seems this one is already fine and seems inconsistent with 
https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101259038
  
--- Diff: 
examples/src/main/python/mllib/linear_regression_with_sgd_example.py ---
@@ -22,9 +22,11 @@
 
 from pyspark import SparkContext
 # $example on$
-from pyspark.mllib.regression import LabeledPoint, 
LinearRegressionWithSGD, LinearRegressionModel
+from pyspark.mllib.regression import (LabeledPoint, LinearRegressionModel,
+  LinearRegressionWithSGD)
--- End diff --

This one also does not look exceeding 100 length limit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101258928
  
--- Diff: examples/src/main/python/mllib/naive_bayes_example.py ---
@@ -24,16 +24,17 @@
 
 from __future__ import print_function
 
+# $example on$
 import shutil
+# $example off$
 
 from pyspark import SparkContext
 # $example on$
 from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
 from pyspark.mllib.util import MLUtils
-
-
--- End diff --

Could I ask to check if the rendered example still complies pep8?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101258974
  
--- Diff: 
examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py ---
@@ -22,9 +22,11 @@
 
 from pyspark import SparkContext
 # $example on$
-from pyspark.mllib.classification import LogisticRegressionWithLBFGS, 
LogisticRegressionModel
+from pyspark.mllib.classification import (LogisticRegressionModel,
+  LogisticRegressionWithLBFGS)
--- End diff --

This one also does not look exceeding 100 length limit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101258832
  
--- Diff: 
examples/src/main/python/mllib/power_iteration_clustering_example.py ---
@@ -19,7 +19,9 @@
 
 from pyspark import SparkContext
 # $example on$
-from pyspark.mllib.clustering import PowerIterationClustering, 
PowerIterationClusteringModel
+from pyspark.mllib.clustering import (PowerIterationClustering,
+  PowerIterationClusteringModel)
--- End diff --

This one also seems not exceeding 100.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101258524
  
--- Diff: 
examples/src/main/python/mllib/streaming_linear_regression_example.py ---
@@ -25,13 +25,14 @@
 # $example off$
 
 from pyspark import SparkContext
-from pyspark.streaming import StreamingContext
 # $example on$
 from pyspark.mllib.linalg import Vectors
-from pyspark.mllib.regression import LabeledPoint
-from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
+from pyspark.mllib.regression import (LabeledPoint,
+  StreamingLinearRegressionWithSGD)
--- End diff --

This does not exceed 100 line length? Up to my knowledge Spark limits it 
100 (not default 80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r101245188
  
--- Diff: examples/src/main/python/ml/cross_validator.py ---
@@ -42,20 +42,22 @@
 
 # $example on$
 # Prepare training documents, which are labeled.
-training = spark.createDataFrame([
-(0, "a b c d e spark", 1.0),
-(1, "b d", 0.0),
-(2, "spark f g h", 1.0),
-(3, "hadoop mapreduce", 0.0),
-(4, "b spark who", 1.0),
-(5, "g d a y", 0.0),
-(6, "spark fly", 1.0),
-(7, "was mapreduce", 0.0),
-(8, "e spark program", 1.0),
-(9, "a e c l", 0.0),
-(10, "spark compile", 1.0),
-(11, "hadoop software", 0.0)
-], ["id", "text", "label"])
+training = spark.createDataFrame(
+[
--- End diff --

It'd great if we have some references or quotes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93996278
  
--- Diff: examples/src/main/python/logistic_regression.py ---
@@ -29,7 +29,6 @@
 import numpy as np
 from pyspark.sql import SparkSession
 
--- End diff --

Why did you remove the double newlines after the end of the imports?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-02-13 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93996113
  
--- Diff: examples/src/main/python/mllib/bisecting_k_means_example.py ---
@@ -20,11 +20,11 @@
 # $example on$
 from numpy import array
 # $example off$
-
 from pyspark import SparkContext
 # $example on$
 from pyspark.mllib.clustering import BisectingKMeans, BisectingKMeansModel
 # $example off$
+#
--- End diff --

Whats this for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2017-01-09 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r95129312
  
--- Diff: 
examples/src/main/python/ml/generalized_linear_regression_example.py ---
@@ -17,9 +17,10 @@
 
 from __future__ import print_function
 
-from pyspark.sql import SparkSession
 # $example on$
 from pyspark.ml.regression import GeneralizedLinearRegression
+from pyspark.sql import SparkSession
+
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-27 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93977216
  
--- Diff: examples/src/main/python/ml/cross_validator.py ---
@@ -42,20 +42,22 @@
 
 # $example on$
 # Prepare training documents, which are labeled.
-training = spark.createDataFrame([
-(0, "a b c d e spark", 1.0),
-(1, "b d", 0.0),
-(2, "spark f g h", 1.0),
-(3, "hadoop mapreduce", 0.0),
-(4, "b spark who", 1.0),
-(5, "g d a y", 0.0),
-(6, "spark fly", 1.0),
-(7, "was mapreduce", 0.0),
-(8, "e spark program", 1.0),
-(9, "a e c l", 0.0),
-(10, "spark compile", 1.0),
-(11, "hadoop software", 0.0)
-], ["id", "text", "label"])
+training = spark.createDataFrame(
+[
--- End diff --

cannot remember if i did that manually or if it has been done by the `pep8` 
tool. I cannot work on it till next week


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-27 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93977062
  
--- Diff: 
examples/src/main/python/ml/generalized_linear_regression_example.py ---
@@ -17,9 +17,10 @@
 
 from __future__ import print_function
 
-from pyspark.sql import SparkSession
 # $example on$
 from pyspark.ml.regression import GeneralizedLinearRegression
+from pyspark.sql import SparkSession
+
--- End diff --

indeed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-27 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93974494
  
--- Diff: 
examples/src/main/python/ml/generalized_linear_regression_example.py ---
@@ -17,9 +17,10 @@
 
 from __future__ import print_function
 
-from pyspark.sql import SparkSession
 # $example on$
 from pyspark.ml.regression import GeneralizedLinearRegression
+from pyspark.sql import SparkSession
+
--- End diff --

It's weird to have one blank line inside of the example and one outside, 
generally we seem to be leaving the blank lines out of the others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93613663
  
--- Diff: examples/src/main/python/ml/vector_slicer_example.py ---
@@ -20,15 +20,18 @@
 # $example on$
 from pyspark.ml.feature import VectorSlicer
 from pyspark.ml.linalg import Vectors
-from pyspark.sql.types import Row
 # $example off$
 from pyspark.sql import SparkSession
+# $example on$
--- End diff --

Oh OK, if this is for a particular reason and not just redundant, fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-22 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93613427
  
--- Diff: examples/src/main/python/ml/vector_slicer_example.py ---
@@ -20,15 +20,18 @@
 # $example on$
 from pyspark.ml.feature import VectorSlicer
 from pyspark.ml.linalg import Vectors
-from pyspark.sql.types import Row
 # $example off$
 from pyspark.sql import SparkSession
+# $example on$
--- End diff --

the SparkSession is usually hiden in other examples. I can add it you 
really want


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93474130
  
--- Diff: examples/src/main/python/ml/vector_slicer_example.py ---
@@ -20,15 +20,18 @@
 # $example on$
 from pyspark.ml.feature import VectorSlicer
 from pyspark.ml.linalg import Vectors
-from pyspark.sql.types import Row
 # $example off$
 from pyspark.sql import SparkSession
+# $example on$
--- End diff --

I think we can wrap the whole block in one set of example on/off tags


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93424188
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -2105,7 +2105,7 @@ documentation), or set the `spark.default.parallelism`
 {:.no_toc}
 The overheads of data serialization can be reduced by tuning the 
serialization formats. In the case of streaming, there are two types of data 
that are being serialized.
 
-* **Input data**: By default, the input data received through Receivers is 
stored in the executors' memory with 
[StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$).
 That is, the data is serialized into bytes to reduce GC overheads, and 
replicated for tolerating executor failures. Also, the data is kept first in 
memory, and spilled over to disk only if the memory is insufficient to hold all 
of the input data necessary for the streaming computation. This serialization 
obviously has overheads -- the receiver must deserialize the received data and 
re-serialize it using Spark's serialization format. 
+* **Input data**: By default, the input data received through Receivers is 
stored in the executors' memory with 
[StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$).
 That is, the data is serialized into bytes to reduce GC overheads, and 
replicated for tolerating executor failures. Also, the data is kept first in 
memory, and spilled over to disk only if the memory is insufficient to hold all 
of the input data necessary for the streaming computation. This serialization 
obviously has overheads -- the receiver must deserialize the received data and 
re-serialize it using Spark's serialization format.
--- End diff --

there is an extra space at the end of the line, github doesn't display it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93424082
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -1626,10 +1626,10 @@ See the full [source 
code]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_
 # Lazily instantiated global instance of SparkSession
 def getSparkSessionInstance(sparkConf):
 if ("sparkSessionSingletonInstance" not in globals()):
-globals()["sparkSessionSingletonInstance"] = SparkSession \
-.builder \
-.config(conf=sparkConf) \
-.getOrCreate()
+globals()["sparkSessionSingletonInstance"] = (SparkSession
--- End diff --

my point was on the use of parenthesis instead of thr backslash, which is 
recommended by pep8. I can keep the indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93423843
  
--- Diff: examples/src/main/python/ml/dct_example.py ---
@@ -23,6 +23,7 @@
 # $example off$
 from pyspark.sql import SparkSession
 
+
--- End diff --

pep8 recommends 2 empty lines after import statements. `pep8` does it 
automatically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread Stibbons
Github user Stibbons commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93423576
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -1099,7 +1099,7 @@ joinedStream = stream1.join(stream2)
 {% endhighlight %}
 
 
-Here, in each batch interval, the RDD generated by `stream1` will be 
joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, 
`rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do 
joins over windows of the streams. That is pretty easy as well. 
+Here, in each batch interval, the RDD generated by `stream1` will be 
joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, 
`rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do 
joins over windows of the streams. That is pretty easy as well.
--- End diff --

indeed, this is my sublime text that does it automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93419478
  
--- Diff: examples/src/main/python/ml/dct_example.py ---
@@ -23,6 +23,7 @@
 # $example off$
 from pyspark.sql import SparkSession
 
+
--- End diff --

Is this needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93419273
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -1099,7 +1099,7 @@ joinedStream = stream1.join(stream2)
 {% endhighlight %}
 
 
-Here, in each batch interval, the RDD generated by `stream1` will be 
joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, 
`rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do 
joins over windows of the streams. That is pretty easy as well. 
+Here, in each batch interval, the RDD generated by `stream1` will be 
joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, 
`rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do 
joins over windows of the streams. That is pretty easy as well.
--- End diff --

I'd avoid just taking whitespace of the end of lines. It's not related to 
your change, and just makes it a little harder to see what your changes are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93419341
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -1626,10 +1626,10 @@ See the full [source 
code]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_
 # Lazily instantiated global instance of SparkSession
 def getSparkSessionInstance(sparkConf):
 if ("sparkSessionSingletonInstance" not in globals()):
-globals()["sparkSessionSingletonInstance"] = SparkSession \
-.builder \
-.config(conf=sparkConf) \
-.getOrCreate()
+globals()["sparkSessionSingletonInstance"] = (SparkSession
--- End diff --

OK. I don't feel so qualified to judge that, but take your word for it. 
However do you really want to indent this so much?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93419445
  
--- Diff: examples/src/main/python/ml/cross_validator.py ---
@@ -42,20 +42,22 @@
 
 # $example on$
 # Prepare training documents, which are labeled.
-training = spark.createDataFrame([
-(0, "a b c d e spark", 1.0),
-(1, "b d", 0.0),
-(2, "spark f g h", 1.0),
-(3, "hadoop mapreduce", 0.0),
-(4, "b spark who", 1.0),
-(5, "g d a y", 0.0),
-(6, "spark fly", 1.0),
-(7, "was mapreduce", 0.0),
-(8, "e spark program", 1.0),
-(9, "a e c l", 0.0),
-(10, "spark compile", 1.0),
-(11, "hadoop software", 0.0)
-], ["id", "text", "label"])
+training = spark.createDataFrame(
+[
--- End diff --

Is this really a pep8 recommendation or just preference? I think the way it 
was is more consistent with other Spark code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14830: [SPARK-16992][PYSPARK][DOCS] import sort and auto...

2016-12-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14830#discussion_r93419361
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -2105,7 +2105,7 @@ documentation), or set the `spark.default.parallelism`
 {:.no_toc}
 The overheads of data serialization can be reduced by tuning the 
serialization formats. In the case of streaming, there are two types of data 
that are being serialized.
 
-* **Input data**: By default, the input data received through Receivers is 
stored in the executors' memory with 
[StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$).
 That is, the data is serialized into bytes to reduce GC overheads, and 
replicated for tolerating executor failures. Also, the data is kept first in 
memory, and spilled over to disk only if the memory is insufficient to hold all 
of the input data necessary for the streaming computation. This serialization 
obviously has overheads -- the receiver must deserialize the received data and 
re-serialize it using Spark's serialization format. 
+* **Input data**: By default, the input data received through Receivers is 
stored in the executors' memory with 
[StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$).
 That is, the data is serialized into bytes to reduce GC overheads, and 
replicated for tolerating executor failures. Also, the data is kept first in 
memory, and spilled over to disk only if the memory is insufficient to hold all 
of the input data necessary for the streaming computation. This serialization 
obviously has overheads -- the receiver must deserialize the received data and 
re-serialize it using Spark's serialization format.
--- End diff --

What is the change here? I can't make it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org