[GitHub] spark pull request #22576: [SPARK-25560][SQL] Allow FunctionInjection in Spa...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/22576#discussion_r226622273 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala --- @@ -168,4 +173,21 @@ class SparkSessionExtensions { def injectParser(builder: ParserBuilder): Unit = { parserBuilders += builder } + + private[this] val injectedFunctions = mutable.Buffer.empty[FunctionDescription] + + private[sql] def registerFunctions(functionRegistry: FunctionRegistry) = { +for ((name, expressionInfo, function) <- injectedFunctions) { --- End diff -- Ha we just changed a function in the opposite direction on my other commit. The project should probably pick one dorm and put it in the style guide. I'll make the chznge --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225992993 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- Updated with replacement then :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225805019 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- I am always a little nervous about having functions return objects they take in as parameters and then modify. Gives an impression to me that they are stateless. If you think that this is clearer I can make the change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21990 Addressed Comments from @HyukjinKwon , I'm interested in @ueshin 's suggestions, but I can't figure out how to do that unless we bake it into the Extensions constructor. If we place it in the Sessions constructor then invocations of "newSession" will reapply existing extensions. I added a note in the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225613517 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- It's difficult here since I'm attempting to cause the least change in behavior for the old code paths :( --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225612270 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- I thought about this and was worried then about multiple invocations of the extensions Once every time the SparkSession is cloned --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225610975 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -1136,4 +1121,27 @@ object SparkSession extends Logging { SparkSession.clearDefaultSession() } } + + /** + * Initialize extensions if the user has defined a configurator class in their SparkConf. + * This class will be applied to the extensions passed into this function. + */ + private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) { --- End diff -- The Default constructor of SparkSession? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r225608605 --- Diff: python/pyspark/sql/session.py --- @@ -219,6 +219,7 @@ def __init__(self, sparkContext, jsparkSession=None): jsparkSession = self._jvm.SparkSession.getDefaultSession().get() else: jsparkSession = self._jvm.SparkSession(self._jsc.sc()) + --- End diff -- I'm addicted to whitespace apparently --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/22576 Cleaned up --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21990 I'm fine with anything really, I still think the ideal solution is probably not to tie the creation of the py4j gateway to the SparkContext, but that's probably a much bigger refactor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/22576 Ah I was registering functions with the built-in registry which is not reset. I've changed it to register only with a clone of the built-in registry. This would allow multiple extensions, or some sessions to use extensions and others not to --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/22576 Looks like the session with extensions from the test suite is leaking to other suites ... Investigating On Fri, Sep 28, 2018 at 11:25 AM UCB AMPLab wrote: > Test FAILed. > Refer to this link for build results (access rights to CI server needed): > > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96757/ > Test FAILed. > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/22576#issuecomment-425489950>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAZNYTu9vCM3h2Z-SgFqxpB3PLKg5o7kks5ufk1igaJpZM4W9uz9> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/22576 @hvanhovell Made a full PR for the change we discussed. Also updated the signature to match the new defined types for the registry and Identifier. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22576: [SPARK-25560][SQL] Allow FunctionInjection in Spa...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/22576 [SPARK-25560][SQL] Allow FunctionInjection in SparkExtensions This allows an implementer of Spark Session Extensions to utilize a method "injectFunction" which will add a new function to the default Spark Session Catalogue. ## What changes were proposed in this pull request? Adds a new function to SparkSessionExtensions def injectFunction(functionDescription: FunctionDescription) Where function description is a new type type FunctionDescription = (FunctionIdentifier, FunctionBuilder) The functions are loaded in BaseSessionBuilder when the function registry does not have a parent function registry to get loaded from. ## How was this patch tested? New unit tests are added for the extension in SparkSessionExtensionSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-25560 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22576.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22576 commit 5fdca38e21448ccf2f24a68fbadc221c87195581 Author: Russell Spitzer Date: 2018-09-28T01:58:21Z [SPARK-25560][SQL] Allow FunctionInjection in SparkExtensions This allows an implementer of Spark Session Extensions to utilize a method "injectFunction" which will add a new function to the default Spark Session Catalogue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21990 Added new method of injecting extensions, this way the "getOrCreate" code from the scala method is not needed. @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21990 @HyukjinKwon so you want me to rewrite the code in python? I will note SparkR is doing this exact same thing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21990 What I wanted was to just call the Scala Methods, instead of having half the code and half in python, but we create the JVM in the SparkContext creation code so this ends up not being a good method I think. We could just translate the rest of GetOrCreate into Python but then every time there is a patch of the code in scala it will need a Python mod as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21990 @HyukjinKwon So i've been staring at this for a while today, and I guess the big issue is that we always need to make a Python SparkContext to get a handle on the JavaGateway, so everything that happens before the context is made cannot just be a wrapper of SQL methods and must reimplement. Unless we decided to refactor the code so that the JVM is more generally available (probably not possible) we will be stuck with redoing code in python ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r210941883 --- Diff: python/pyspark/sql/tests.py --- @@ -3563,6 +3563,51 @@ def test_query_execution_listener_on_collect_with_arrow(self): "The callback from the query execution listener should be called after 'toPandas'") +class SparkExtensionsTest(unittest.TestCase, SQLTestUtils): +# These tests are separate because it uses 'spark.sql.extensions' which is +# static and immutable. This can't be set or unset, for example, via `spark.conf`. + +@classmethod +def setUpClass(cls): +import glob +from pyspark.find_spark_home import _find_spark_home + +SPARK_HOME = _find_spark_home() +filename_pattern = ( +"sql/core/target/scala-*/test-classes/org/apache/spark/sql/" +"SparkSessionExtensionSuite.class") +if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)): +raise unittest.SkipTest( +"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not " +"available. Will skip the related tests.") + +# Note that 'spark.sql.extensions' is a static immutable configuration. +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config( +"spark.sql.extensions", +"org.apache.spark.sql.MyExtensions") \ --- End diff -- Reduce, reuse, recycle :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r210909196 --- Diff: python/pyspark/sql/tests.py --- @@ -3563,6 +3563,51 @@ def test_query_execution_listener_on_collect_with_arrow(self): "The callback from the query execution listener should be called after 'toPandas'") +class SparkExtensionsTest(unittest.TestCase, SQLTestUtils): +# These tests are separate because it uses 'spark.sql.extensions' which is +# static and immutable. This can't be set or unset, for example, via `spark.conf`. + +@classmethod +def setUpClass(cls): +import glob +from pyspark.find_spark_home import _find_spark_home + +SPARK_HOME = _find_spark_home() +filename_pattern = ( +"sql/core/target/scala-*/test-classes/org/apache/spark/sql/" +"SparkSessionExtensionSuite.class") +if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)): +raise unittest.SkipTest( +"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not " +"available. Will skip the related tests.") + +# Note that 'spark.sql.extensions' is a static immutable configuration. +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config( +"spark.sql.extensions", +"org.apache.spark.sql.MyExtensions") \ +.getOrCreate() + +@classmethod +def tearDownClass(cls): +cls.spark.stop() + +def tearDown(self): +self.spark._jvm.OnSuccessCall.clear() --- End diff -- Sounds good to me, i'll take that out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r210909083 --- Diff: python/pyspark/sql/tests.py --- @@ -3563,6 +3563,51 @@ def test_query_execution_listener_on_collect_with_arrow(self): "The callback from the query execution listener should be called after 'toPandas'") +class SparkExtensionsTest(unittest.TestCase, SQLTestUtils): +# These tests are separate because it uses 'spark.sql.extensions' which is +# static and immutable. This can't be set or unset, for example, via `spark.conf`. + +@classmethod +def setUpClass(cls): +import glob +from pyspark.find_spark_home import _find_spark_home + +SPARK_HOME = _find_spark_home() +filename_pattern = ( +"sql/core/target/scala-*/test-classes/org/apache/spark/sql/" +"SparkSessionExtensionSuite.class") +if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)): +raise unittest.SkipTest( +"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not " +"available. Will skip the related tests.") + +# Note that 'spark.sql.extensions' is a static immutable configuration. +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config( +"spark.sql.extensions", +"org.apache.spark.sql.MyExtensions") \ --- End diff -- I'm not sure what you mean here? This class already exists as part of the SparkSqlExtensions Suite and i'm just reusing with no changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r210428804 --- Diff: python/pyspark/sql/session.py --- @@ -218,7 +218,9 @@ def __init__(self, sparkContext, jsparkSession=None): .sparkContext().isStopped(): jsparkSession = self._jvm.SparkSession.getDefaultSession().get() else: -jsparkSession = self._jvm.SparkSession(self._jsc.sc()) +jsparkSession = self._jvm.SparkSession.builder() \ +.sparkContext(self._jsc.sc()) \ +.getOrCreate() --- End diff -- Yeah let me add in the test, and then I'll clear out all the python duplication of Scala code. I can make it more of a wrapper and less of a reimplementer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21988: [SPARK-25003][PYSPARK][BRANCH-2.2] Use SessionExt...
Github user RussellSpitzer closed the pull request at: https://github.com/apache/spark/pull/21988 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21989: [SPARK-25003][PYSPARK][BRANCH-2.3] Use SessionExt...
Github user RussellSpitzer closed the pull request at: https://github.com/apache/spark/pull/21989 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21988: [SPARK-25003][PYSPARK][BRANCH-2.2] Use SessionExtensions...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21988 @felixcheung I just didn't know what version to target so I made a a PR for each one. We can just close the ones that shouldn't be merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21989: [SPARK-25003][PYSPARK][BRANCH-2.3] Use SessionExtensions...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21989 @kiszk sure, it all depends which branch the merge target should be I wasn't sure which one was being used for changes of this nature. Technically it's a bug fix I believe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21988: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/21988 Local PEP didn't seem to mind this code ... Fixed up the indentation so hopefully jenkins will like it now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/21990 [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Master ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? This was manually tested by passing a class to spark.sql.extensions and making sure it's included strategies appeared in the spark._jsparkSession.sessionState.planner.strategies list. We could add a automatic test but i'm not very familiar with the Pyspark Testing framework. But I would be glad to implement that if requested. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-25003-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21990.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21990 commit f790ae000ae1c3f4030162028186448d345e2984 Author: Russell Spitzer Date: 2018-08-03T16:04:00Z [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21989: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/21989 [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark (Branch-2.3) ##What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ##How was this patch tested? This was manually tested by passing a class to spark.sql.extensions and making sure it's included strategies appeared in the spark._jsparkSession.sessionState.planner.strategies list. We could add a automatic test but i'm not very familiar with the Pyspark Testing framework. But I would be glad to implement that if requested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-25003-branch-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21989.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21989 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21988: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/21988 [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? This was manually tested by passing a class to spark.sql.extensions and making sure it's included strategies appeared in the spark._jsparkSession.sessionState.planner.strategies list. We could add a automatic test but i'm not very familiar with the Pyspark Testing framework. But I would be glad to implement that if requested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-25003-branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21988.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21988 commit 9c6e0671bf7311e2647e774c8d247c43037fc12c Author: Russell Spitzer Date: 2018-08-03T15:48:15Z [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21453: Test branch to see how Scala 2.11.12 performs
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/21453 Test branch to see how Scala 2.11.12 performs This may be useful when Java 8 is no longer supported since Scala 2.11.12 supports later versions of Java ## What changes were proposed in this pull request? Change Scala Build Version to 2.11.12. ## How was this patch tested? This PR is made to run 2.11.12 Scala through Jenkins to see whether or not it passes cleanly. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark Scala2_11_12_Test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21453.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21453 commit 90d3842616aec94d603a68d44463eb043c5a66f9 Author: Russell Spitzer Date: 2018-05-29T18:17:23Z Test branch to see how Scala 2.11.12 performs This may be useful when Java 8 is no longer supported since Scala 2.11.12 supports later versions of Java --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20190: [SPARK-22976][Core]: Cluster mode driver director...
Github user RussellSpitzer closed the pull request at: https://github.com/apache/spark/pull/20190 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20190: [SPARK-22976][Core]: Cluster mode driver directories can...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/20190 @jerryshao https://github.com/apache/spark/pull/20298 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20298: [SPARK-22976][Core]: Cluster mode driver dir remo...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/20298 [SPARK-22976][Core]: Cluster mode driver dir removed while running ## What changes were proposed in this pull request? The clean up logic on the worker perviously determined the liveness of a particular applicaiton based on whether or not it had running executors. This would fail in the case that a directory was made for a driver running in cluster mode if that driver had no running executors on the same machine. To preserve driver directories we consider both executors and running drivers when checking directory liveness. ## How was this patch tested? Manually started up two node cluster with a single core on each node. Turned on worker directory cleanup and set the interval to 1 second and liveness to one second. Without the patch the driver directory is removed immediately after the app is launched. With the patch it is not ### Without Patch ``` INFO 2018-01-05 23:48:24,693 Logging.scala:54 - Asked to launch driver driver-20180105234824- INFO 2018-01-05 23:48:25,293 Logging.scala:54 - Changing view acls to: cassandra INFO 2018-01-05 23:48:25,293 Logging.scala:54 - Changing modify acls to: cassandra INFO 2018-01-05 23:48:25,294 Logging.scala:54 - Changing view acls groups to: INFO 2018-01-05 23:48:25,294 Logging.scala:54 - Changing modify acls groups to: INFO 2018-01-05 23:48:25,294 Logging.scala:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra); groups with view permissions: Set(); users with modify permissions: Set(cassandra); groups with modify permissions: Set() INFO 2018-01-05 23:48:25,330 Logging.scala:54 - Copying user jar file:/home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar INFO 2018-01-05 23:48:25,332 Logging.scala:54 - Copying /home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar INFO 2018-01-05 23:48:25,361 Logging.scala:54 - Launch Command: "/usr/lib/jvm/jdk1.8.0_40//bin/java" INFO 2018-01-05 23:48:56,577 Logging.scala:54 - Removing directory: /var/lib/spark/worker/driver-20180105234824- ### << Cleaned up -- One minute passes while app runs (app has 1 minute sleep built in) -- WARN 2018-01-05 23:49:58,080 ShuffleSecretManager.java:73 - Attempted to unregister application app-20180105234831- when it is not registered INFO 2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - Application app-20180105234831- removed, cleanupLocalDirs = false INFO 2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - Application app-20180105234831- removed, cleanupLocalDirs = false INFO 2018-01-05 23:49:58,082 ExternalShuffleBlockResolver.java:163 - Application app-20180105234831- removed, cleanupLocalDirs = true INFO 2018-01-05 23:50:00,999 Logging.scala:54 - Driver driver-20180105234824- exited successfully ``` With Patch ``` INFO 2018-01-08 23:19:54,603 Logging.scala:54 - Asked to launch driver driver-20180108231954-0002 INFO 2018-01-08 23:19:54,975 Logging.scala:54 - Changing view acls to: automaton INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls to: automaton INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing view acls groups to: INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls groups to: INFO 2018-01-08 23:19:54,976 Logging.scala:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(automaton); groups with view permissions: Set(); users with modify permissions: Set(automaton); groups with modify permissions: Set() INFO 2018-01-08 23:19:55,029 Logging.scala:54 - Copying user jar file:/home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar INFO 2018-01-08 23:19:55,031 Logging.scala:54 - Copying /home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar INFO 2018-01-08 23:19:55,038 Logging.scala:54 - Launch Command: .. INFO 2018-01-08 23:21:28,674 ShuffleSecretManager.java:69 - Unregistered shuffle secret for application app-20180108232000- INFO 2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - Application app-20180108232000- removed, cleanupLocalDirs = false INFO 2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - Application app-20180108232000- removed, cleanupLocalDirs = false INFO 2018-01-08 23:21:28,681 ExternalShuffleBlockResolver.java:163 - Application app-20180108232000- removed, cleanupLocalDirs = true INFO 2018-01-08 23:21:31,703
[GitHub] spark issue #20201: [SPARK-22389][SQL] data source v2 partitioning reporting...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/20201 This looks very exciting to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20190: [SPARK-22976][Core]: Cluster mode driver directories can...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/20190 @zsxwing I think you were the last to touch this code, could you please review? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20190: [SPARK-22976][Core]: Cluster mode driver director...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/20190 [SPARK-22976][Core]: Cluster mode driver directories can be removed w⦠â¦hile running ## What changes were proposed in this pull request? The clean up logic on the worker perviously determined the liveness of a particular applicaiton based on whether or not it had running executors. This would fail in the case that a directory was made for a driver running in cluster mode if that driver had no running executors on the same machine. To preserve driver directories we consider both executors and running drivers when checking directory liveness. ## How was this patch tested? Manually started up two node cluster with a single core on each node. Turned on worker directory cleanup and set the interval to 1 second and liveness to one second. Without the patch the driver directory is removed immediately after the app is launched. With the patch it is not ### Without Patch ``` INFO 2018-01-05 23:48:24,693 Logging.scala:54 - Asked to launch driver driver-20180105234824- INFO 2018-01-05 23:48:25,293 Logging.scala:54 - Changing view acls to: cassandra INFO 2018-01-05 23:48:25,293 Logging.scala:54 - Changing modify acls to: cassandra INFO 2018-01-05 23:48:25,294 Logging.scala:54 - Changing view acls groups to: INFO 2018-01-05 23:48:25,294 Logging.scala:54 - Changing modify acls groups to: INFO 2018-01-05 23:48:25,294 Logging.scala:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra); groups with view permissions: Set(); users with modify permissions: Set(cassandra); groups with modify permissions: Set() INFO 2018-01-05 23:48:25,330 Logging.scala:54 - Copying user jar file:/home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar INFO 2018-01-05 23:48:25,332 Logging.scala:54 - Copying /home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar INFO 2018-01-05 23:48:25,361 Logging.scala:54 - Launch Command: "/usr/lib/jvm/jdk1.8.0_40//bin/java" INFO 2018-01-05 23:48:56,577 Logging.scala:54 - Removing directory: /var/lib/spark/worker/driver-20180105234824- ### << Cleaned up -- One minute passes while app runs (app has 1 minute sleep built in) -- WARN 2018-01-05 23:49:58,080 ShuffleSecretManager.java:73 - Attempted to unregister application app-20180105234831- when it is not registered INFO 2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - Application app-20180105234831- removed, cleanupLocalDirs = false INFO 2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - Application app-20180105234831- removed, cleanupLocalDirs = false INFO 2018-01-05 23:49:58,082 ExternalShuffleBlockResolver.java:163 - Application app-20180105234831- removed, cleanupLocalDirs = true INFO 2018-01-05 23:50:00,999 Logging.scala:54 - Driver driver-20180105234824- exited successfully ``` With Patch ``` INFO 2018-01-08 23:19:54,603 Logging.scala:54 - Asked to launch driver driver-20180108231954-0002 INFO 2018-01-08 23:19:54,975 Logging.scala:54 - Changing view acls to: automaton INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls to: automaton INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing view acls groups to: INFO 2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls groups to: INFO 2018-01-08 23:19:54,976 Logging.scala:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(automaton); groups with view permissions: Set(); users with modify permissions: Set(automaton); groups with modify permissions: Set() INFO 2018-01-08 23:19:55,029 Logging.scala:54 - Copying user jar file:/home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar INFO 2018-01-08 23:19:55,031 Logging.scala:54 - Copying /home/automaton/writeRead-0.1.jar to /var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar INFO 2018-01-08 23:19:55,038 Logging.scala:54 - Launch Command: .. INFO 2018-01-08 23:21:28,674 ShuffleSecretManager.java:69 - Unregistered shuffle secret for application app-20180108232000- INFO 2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - Application app-20180108232000- removed, cleanupLocalDirs = false INFO 2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - Application app-20180108232000- removed, cleanupLocalDirs = false INFO 2018-01-08 23:21:28,681 ExternalShuffleBlockResolver.java:163 - Application app-20180108232000- removed, cleanupLocalDirs = true
[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r137531270 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.io.Serializable; + +/** + * A read task returned by a data source reader and is responsible to create the data reader. + * The relationship between `ReadTask` and `DataReader` is similar to `Iterable` and `Iterator`. + * + * Note that, the read task will be serialized and sent to executors, then the data reader will be + * created on executors and do the actual reading. + */ +public interface ReadTask extends Serializable { + /** + * The preferred locations for this read task to run faster, but Spark can't guarantee that this + * task will always run on these locations. Implementations should make sure that it can + * be run on any location. + */ + default String[] preferredLocations() { --- End diff -- These have previously only been ip/hostnames. To match the RDD definition I think we would have to continue with that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r137333974 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; + +import org.apache.spark.annotation.Experimental; +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder; +import org.apache.spark.sql.catalyst.encoders.RowEncoder; +import org.apache.spark.sql.catalyst.expressions.UnsafeRow; +import org.apache.spark.sql.types.StructType; + +/** + * The main interface and minimal requirement for a data source reader. The implementations should + * at least implement the full scan logic, users can mix in more interfaces to implement scan + * optimizations like column pruning, filter push down, etc. + * + * There are mainly 2 kinds of scan optimizations: + * 1. push operators downward to the data source, e.g., column pruning, filter push down, etc. + * 2. propagate information upward to Spark, e.g., report statistics, report ordering, etc. + * Spark first applies all operator push down optimizations which this data source supports. Then + * Spark collects information this data source provides for further optimizations. Finally Spark + * issues the scan request and does the actual data reading. --- End diff -- This would be really nice imho. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10655: [SPARK-12639][SQL] Improve Explain for Datasource...
Github user RussellSpitzer closed the pull request at: https://github.com/apache/spark/pull/10655 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10655: [SPARK-12639][SQL] Improve Explain for Datasources with ...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/10655 We fixed this on a different pr https://github.com/apache/spark/pull/11317 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11796: [SPARK-13579][build][test-maven] Stop building th...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/11796#discussion_r73781398 --- Diff: assembly/pom.xml --- @@ -69,6 +68,17 @@ spark-repl_${scala.binary.version} ${project.version} + + + --- End diff -- This is a problem for the Spark Cassandra Connector. The Cassandra Java Driver requires a 16.0 or greater version of guava. This necessarily means we need to shade now. This was on our roadmap anyway just wanted you to be aware. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11317: [SPARK-12639] [SQL] Mark Filters Fully Handled By Source...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/11317 Updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13652: [SPARK-15613] [SQL] Fix incorrect days to millis convers...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/13652 I would love to be able to just specify days since epoch rather than using java.sql.Date --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13652: [SPARK-] Fix incorrect days to millis conversion
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/13652 I think this is unfortunately the right thing to do. I wish we didn't have to use java.sql.Date :( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12639] [SQL] Mark Filters Fully Handled...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/11317#issuecomment-218052162 I don't think this is because of me ``` # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (sharedRuntime.cpp:834), pid=27637, tid=14737998592 # fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x7f54dd188231 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/jenkins/workspace/SparkPullRequestBuilder@2/hs_err_pid27637.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # Traceback (most recent call last): File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 295, in _handle_request_noblock self.process_request(request, client_address) File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 321, in process_request self.finish_request(request, client_address) File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 334, in finish_request self.RequestHandlerClass(request, client_address, self) File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 655, in __init__ self.handle() File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/accumulators.py", line 235, in handle num_updates = read_int(self.rfile) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/serializers.py", line 545, in read_int raise EOFError EOFError ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:40970) Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 713, in start self.socket.connect((self.address, self.port)) File "", line 1, in connect error: [Errno 111] Connection refused ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12639] [SQL] Mark Filters Fully Handled...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/11317#issuecomment-218030539 @HyukjinKwon + @yhuai Sorry it took so long! Things have been busy :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-217483240 Sorry I forgot about this, I'll clean this up tomorrow and get it ready --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12639] [SQL] Mark Filters Fully Handled...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/11317 [SPARK-12639] [SQL] Mark Filters Fully Handled By Sources with * ## What changes were proposed in this pull request? In order to make it clear which filters are fully handled by the underlying datasource we will mark them with a *. This will give a clear visual queue to users that the filter is being treated differently by catalyst than filters which are just presented to the underlying DataSource. ## How was the this patch tested? Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested, You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-12639-Star Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11317.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11317 commit c33daaee15fea96129ae91a899ec70429e4f66ec Author: Russell Spitzer <russell.spit...@gmail.com> Date: 2016-01-26T22:39:38Z SPARK-12639 SQL Mark Filters Fully Handled By Sources with * In order to make it clear which filters are fully handled by the underlying datasource we will mark them with a *. This will give a clear visual queue to users that the filter is being treated differently by catalyst than filters which are just presented to the underlying DataSource. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Mark Filters Fully Handled By ...
Github user RussellSpitzer closed the pull request at: https://github.com/apache/spark/pull/10929 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Mark Filters Fully Handled By ...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/10929 SPARK-12639 SQL Mark Filters Fully Handled By Sources with * In order to make it clear which filters are fully handled by the underlying datasource we will mark them with a *. This will give a clear visual queue to users that the filter is being treated differently by catalyst than filters which are just presented to the underlying DataSource. You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-12639-Star Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10929.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10929 commit 40a75a4bd0075a42e4d28c849987c37d017c44c4 Author: Russell Spitzer <russell.spit...@gmail.com> Date: 2016-01-26T22:39:38Z SPARK-12639 SQL Mark Filters Fully Handled By Sources with * In order to make it clear which filters are fully handled by the underlying datasource we will mark them with a *. This will give a clear visual queue to users that the filter is being treated differently by catalyst than filters which are just presented to the underlying DataSource. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13021][CORE] Fail fast when custom RDDs...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/10932#issuecomment-175294425 I'm +1 on this in 2.0 :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-174113049 Haven't forgotten this will have a new pr soon :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-172145152 I personally think the ambiguous `PUSHED_FILTERS` is more confusing. When we see a predicate there we have no idea whether or not it is a valid filter for the source at all. Like in the C* case this could contain clauses which have no way have being actually pushed down to the source. In my mind their are 3 Categories of predicates * Those which cannot be pushed to the source at all * Those which can be pushed to the source but may have false positives * Those which can be pushed to the source and filter completely Currently we can only tell whether or not a predicate is in on of the first two categories or if it is in the third. This leaves is awkwardly stating that the source has had a predicate `Pushed` to it even when that is impossible. I like just stating the Third category because thats the only thing we truly can be sure of given the current code. It would be better if the underlying source was able to qualify all filters into the above categories. So to me it is more confusing to say something is `Pushed` when it technically can't be than to say something is not `Pushed` when it might be. But ymmv If you want to just go with Asterisks thats fine with me too just wanted to make my argument :D --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-172136871 @yhuai I removed the PushedFilters and add the other examples. We could read-add the "PushedFilters" if you like. I wasn't sure if you still wanted that. I'm still not sure if it's very valuable info since everything is `Pushed` if I understood. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/10655#discussion_r49342098 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala --- @@ -114,6 +114,7 @@ private[sql] object PhysicalRDD { // Metadata keys val INPUT_PATHS = "InputPaths" val PUSHED_FILTERS = "PushedFilters" + val HANDLED_FILTERS = "HandledFilters" --- End diff -- sgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-170064440 @rxin Added, basically I think the current "PushedFilters" list isn't very valuable if everything is listed there. So instead we should just list those filters which the source can actually do something with. If there is a possibility that a source might do something (bloom flitery) we should have a third category but currently there is no way of knowing (i think) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/10655#discussion_r49153042 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala --- @@ -114,6 +114,7 @@ private[sql] object PhysicalRDD { // Metadata keys val INPUT_PATHS = "InputPaths" val PUSHED_FILTERS = "PushedFilters" + val HANDLED_FILTERS = "HandledFilters" --- End diff -- How about `FilteredAtSource` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user RussellSpitzer commented on a diff in the pull request: https://github.com/apache/spark/pull/10655#discussion_r49152876 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -321,8 +321,8 @@ private[sql] object DataSourceStrategy extends Strategy with Logging { val metadata: Map[String, String] = { val pairs = ArrayBuffer.empty[(String, String)] - if (pushedFilters.nonEmpty) { -pairs += (PUSHED_FILTERS -> pushedFilters.mkString("[", ", ", "]")) + if (handledPredicates.nonEmpty) { +pairs += (HANDLED_FILTERS -> handledPredicates.mkString("[", ", ", "]")) --- End diff -- I thought 11663 meant all filters are pushed down, regardless so I wondered if that was redundant? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
GitHub user RussellSpitzer opened a pull request: https://github.com/apache/spark/pull/10655 SPARK-12639 SQL Improve Explain for Datasources with Handled Predicates SPARK-11661 Makes all predicates pushed down to underlying Datasources regardless of whether the source can handle them or not. This makes the explain command slightly confusing as it will always list all filters whether or not the underlying source can actually use them. Instead now we should only list those filters which are expressly handled by the underlying source. All predicates are pushed down so there really isn't any value in listing them. You can merge this pull request into a Git repository by running: $ git pull https://github.com/RussellSpitzer/spark SPARK-12639 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10655.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10655 commit 42cc0e76148c39d84a5635058698262862a60e6f Author: Russell Spitzer <russell.spit...@gmail.com> Date: 2016-01-08T01:33:59Z SPARK-12639: Improve Explain for Datasources with Handled Predicates SPARK-11661 Makes all predicates pushed down to underlying Datasources regardless of whether the source can handle them or not. This makes the explain command slightly confusing as it will always list all filters whether or not the underlying source can actually use them. Instead now we should only list those filters which are expressly handled by the underlying source. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11415: Remove timezone shift of Catalyst...
Github user RussellSpitzer commented on the pull request: https://github.com/apache/spark/pull/9369#issuecomment-157886376 np --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11415: Remove timezone shift of Catalyst...
Github user RussellSpitzer closed the pull request at: https://github.com/apache/spark/pull/9369 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org