[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16119 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93598376 --- Diff: python/pyspark/sql/tests.py --- @@ -1851,6 +1851,71 @@ def test_hivecontext(self): self.assertIn("default", out.decode('utf-8')) self.assertTrue(os.path.exists(metastore_path)) +def test_hivecontext_create_dataframe(self): --- End diff -- I think we can just create two `SQLContext` and compare if they share the same `SparkSession` instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user vijoshi commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93596548 --- Diff: python/pyspark/sql/tests.py --- @@ -1851,6 +1851,71 @@ def test_hivecontext(self): self.assertIn("default", out.decode('utf-8')) self.assertTrue(os.path.exists(metastore_path)) +def test_hivecontext_create_dataframe(self): --- End diff -- did you mean have the same scenario where multiple SQLContext objects are constructed but the test success criteria is to verify that no more than a single `SparkSession` object ever gets created? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93596371 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- I found a relate discussion about multiple `SparkContext` at http://stackoverflow.com/questions/28259756/how-to-create-multiple-sparkcontexts-in-a-console I think the usage of multiple contexts is discouraged, even you could do it in Scala/Java. Specially, you can't have multiple `SparkContext` in Python. So I am not worrying that another `SparkContext` other than the active one is passed in when constructing `SQLContext`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user vijoshi commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93594980 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() +else: --- End diff -- @cloud-fan refer my comments above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user vijoshi commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93594802 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- 1. the `SQLContext __init__` suggested that any sparkSession object could be injected. And the `SparkSession __init__` suggested any sparkContext could be injected. If not so - I thought they would all be invoking `getOrCreate()` to get instances of each other. 2. I found a few hits around this property - `spark.driver.allowMultipleContexts` which though not a python property - showed that there may be cases where someone (for testing etc) wanted the ability to bypass the one sparkContext limitation. These two combined I was not 100% sure someone couldn't send in different sparkContext objects / sparkSession objects tied to different sparkContexts when constructing an SQLContext. I will admit I am not familiar with Python side of things, so I hoped this PR review would be the best way to get this clarified :). If everyone agrees that we needn't worry about what sparkSession is passed in when constructing SQLContext, then I can get rid of the `if sparkContext is SparkContext._active_spark_context:` check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93576154 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() +else: --- End diff -- when will hit the else branch? The if-else seems the only difference between this PR and https://github.com/apache/spark/pull/16369 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93575937 --- Diff: python/pyspark/sql/tests.py --- @@ -1851,6 +1851,71 @@ def test_hivecontext(self): self.assertIn("default", out.decode('utf-8')) self.assertTrue(os.path.exists(metastore_path)) +def test_hivecontext_create_dataframe(self): --- End diff -- can we write a more low-level unit test to check if we create 2 `SparkSession`s? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93556600 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- So you can see in the pr, I simply do `sparkSession = SparkSession.builder.getOrCreate()` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93556537 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- Yeah, I've noticed this when doing the other pr. But I think we can have only one `SparkContext`. Actually when you try to create another `SparkContext`, you will see error. So I don't know why we need to pass in other SparkContext which is not the active one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r93456555 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- @viirya since you opened a similar change, what do you think of this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user vijoshi commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r92333556 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- okay - I wanted to avoid adding code to the new SparkSession class to handle this compatibility issue arising out of the now deprecated SQLContext class. Looks like the python impl of SparkSession builder does not allow a SparkContext to be passed in. Do we want to change the public builder interface for this ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16119: [SPARK-18687][Pyspark][SQL]Backward compatibility...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16119#discussion_r92134494 --- Diff: python/pyspark/sql/context.py --- @@ -72,8 +72,13 @@ def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): self._sc = sparkContext self._jsc = self._sc._jsc self._jvm = self._sc._jvm + if sparkSession is None: -sparkSession = SparkSession(sparkContext) +if sparkContext is SparkContext._active_spark_context: +sparkSession = SparkSession.builder.getOrCreate() --- End diff -- I don't know enough to usefully review, but is the better fix to call the getOrCreate that lets you pass a context and use that for both cases? I don't know if it is the same behavior but seems more natural to use the logic in SparkSession for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org