[GitHub] spark pull request #22576: [SPARK-25560][SQL] Allow FunctionInjection in Spa...

2018-10-19 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/22576#discussion_r226622273
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala ---
@@ -168,4 +173,21 @@ class SparkSessionExtensions {
   def injectParser(builder: ParserBuilder): Unit = {
 parserBuilders += builder
   }
+
+  private[this] val injectedFunctions = 
mutable.Buffer.empty[FunctionDescription]
+
+  private[sql] def registerFunctions(functionRegistry: FunctionRegistry) = 
{
+for ((name, expressionInfo, function) <- injectedFunctions) {
--- End diff --

Ha we just changed a function in the opposite direction on my other commit. 
The project should probably pick one dorm and put it in the style guide. I'll 
make the chznge


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-17 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225992993
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

Updated with replacement then :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-17 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225805019
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

I am always a little nervous about having functions return objects they 
take in as parameters and then modify. Gives an impression to me that they are 
stateless. If you think that this is clearer I can make the change. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-10-16 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
Addressed Comments from @HyukjinKwon , I'm interested in @ueshin 's 
suggestions, but I can't figure out how to do that unless we bake it into the 
Extensions constructor. If we place it in the Sessions constructor then 
invocations of "newSession" will reapply existing extensions. I added a note in 
the code. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-16 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225613517
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

It's difficult here since I'm attempting to cause the least change in 
behavior for the old code paths :(


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-16 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225612270
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

I thought about this and was worried then about multiple invocations of the 
extensions
Once every time the SparkSession is cloned


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-16 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225610975
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

The Default constructor of SparkSession?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-16 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225608605
  
--- Diff: python/pyspark/sql/session.py ---
@@ -219,6 +219,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 jsparkSession = 
self._jvm.SparkSession.getDefaultSession().get()
 else:
 jsparkSession = self._jvm.SparkSession(self._jsc.sc())
+
--- End diff --

I'm addicted to whitespace apparently


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...

2018-10-16 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/22576
  
Cleaned up


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-09-29 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
I'm fine with anything really, I still think the ideal solution is probably 
not to tie the creation of the py4j gateway to the SparkContext, but that's 
probably a much bigger refactor.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...

2018-09-28 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/22576
  
Ah I was registering functions with the built-in registry which is not 
reset. I've changed it to register only with a clone of the built-in registry. 
This would allow multiple extensions, or some sessions to use extensions and 
others not to 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...

2018-09-28 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/22576
  
Looks like the session with extensions from the test suite is leaking to
other suites ... Investigating

On Fri, Sep 28, 2018 at 11:25 AM UCB AMPLab 
wrote:

> Test FAILed.
> Refer to this link for build results (access rights to CI server needed):
>
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96757/
> Test FAILed.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22576#issuecomment-425489950>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AAZNYTu9vCM3h2Z-SgFqxpB3PLKg5o7kks5ufk1igaJpZM4W9uz9>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22576: [SPARK-25560][SQL] Allow FunctionInjection in SparkExten...

2018-09-27 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/22576
  
@hvanhovell Made a full PR for the change we discussed. Also updated the 
signature to match the new defined types for the registry and Identifier. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22576: [SPARK-25560][SQL] Allow FunctionInjection in Spa...

2018-09-27 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/22576

[SPARK-25560][SQL] Allow FunctionInjection in SparkExtensions

This allows an implementer of Spark Session Extensions to utilize a
method "injectFunction" which will add a new function to the default
Spark Session Catalogue.

## What changes were proposed in this pull request?

Adds a new function to SparkSessionExtensions 

def injectFunction(functionDescription: FunctionDescription)

Where function description is a new type

  type FunctionDescription = (FunctionIdentifier, FunctionBuilder)

The functions are loaded in BaseSessionBuilder when the function registry 
does not have a parent
function registry to get loaded from.

## How was this patch tested?

New unit tests are added for the extension in SparkSessionExtensionSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-25560

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22576.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22576


commit 5fdca38e21448ccf2f24a68fbadc221c87195581
Author: Russell Spitzer 
Date:   2018-09-28T01:58:21Z

[SPARK-25560][SQL] Allow FunctionInjection in SparkExtensions

This allows an implementer of Spark Session Extensions to utilize a
method "injectFunction" which will add a new function to the default
Spark Session Catalogue.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-09-19 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
Added new method of injecting extensions, this way the "getOrCreate" code 
from the scala method is not needed. @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-09-18 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
@HyukjinKwon so you want me to rewrite the code in python? I will note 
SparkR is doing this exact same thing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-08-27 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
What I wanted was to just call the Scala Methods, instead of having half 
the code and half in python, but we create the JVM in the SparkContext creation 
code so this ends up not being a good method I think. We could just translate 
the rest of GetOrCreate into Python but then every time there is a patch of the 
code in scala it will need a Python mod as well. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-08-20 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
@HyukjinKwon So i've been staring at this for a while today, and I guess 
the big issue is that we always need to make a Python SparkContext to get a 
handle on the JavaGateway, so everything that happens before the context is 
made cannot just be a wrapper of SQL methods and must reimplement. Unless we 
decided to refactor the code so that the JVM is more generally available 
(probably not possible) we will be stuck with redoing code in python ... 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-17 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r210941883
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3563,6 +3563,51 @@ def 
test_query_execution_listener_on_collect_with_arrow(self):
 "The callback from the query execution listener should be 
called after 'toPandas'")
 
 
+class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):
+# These tests are separate because it uses 'spark.sql.extensions' 
which is
+# static and immutable. This can't be set or unset, for example, via 
`spark.conf`.
+
+@classmethod
+def setUpClass(cls):
+import glob
+from pyspark.find_spark_home import _find_spark_home
+
+SPARK_HOME = _find_spark_home()
+filename_pattern = (
+"sql/core/target/scala-*/test-classes/org/apache/spark/sql/"
+"SparkSessionExtensionSuite.class")
+if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
+raise unittest.SkipTest(
+"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not 
"
+"available. Will skip the related tests.")
+
+# Note that 'spark.sql.extensions' is a static immutable 
configuration.
+cls.spark = SparkSession.builder \
+.master("local[4]") \
+.appName(cls.__name__) \
+.config(
+"spark.sql.extensions",
+"org.apache.spark.sql.MyExtensions") \
--- End diff --

Reduce, reuse, recycle :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-17 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r210909196
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3563,6 +3563,51 @@ def 
test_query_execution_listener_on_collect_with_arrow(self):
 "The callback from the query execution listener should be 
called after 'toPandas'")
 
 
+class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):
+# These tests are separate because it uses 'spark.sql.extensions' 
which is
+# static and immutable. This can't be set or unset, for example, via 
`spark.conf`.
+
+@classmethod
+def setUpClass(cls):
+import glob
+from pyspark.find_spark_home import _find_spark_home
+
+SPARK_HOME = _find_spark_home()
+filename_pattern = (
+"sql/core/target/scala-*/test-classes/org/apache/spark/sql/"
+"SparkSessionExtensionSuite.class")
+if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
+raise unittest.SkipTest(
+"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not 
"
+"available. Will skip the related tests.")
+
+# Note that 'spark.sql.extensions' is a static immutable 
configuration.
+cls.spark = SparkSession.builder \
+.master("local[4]") \
+.appName(cls.__name__) \
+.config(
+"spark.sql.extensions",
+"org.apache.spark.sql.MyExtensions") \
+.getOrCreate()
+
+@classmethod
+def tearDownClass(cls):
+cls.spark.stop()
+
+def tearDown(self):
+self.spark._jvm.OnSuccessCall.clear()
--- End diff --

Sounds good to me, i'll take that out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-17 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r210909083
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3563,6 +3563,51 @@ def 
test_query_execution_listener_on_collect_with_arrow(self):
 "The callback from the query execution listener should be 
called after 'toPandas'")
 
 
+class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):
+# These tests are separate because it uses 'spark.sql.extensions' 
which is
+# static and immutable. This can't be set or unset, for example, via 
`spark.conf`.
+
+@classmethod
+def setUpClass(cls):
+import glob
+from pyspark.find_spark_home import _find_spark_home
+
+SPARK_HOME = _find_spark_home()
+filename_pattern = (
+"sql/core/target/scala-*/test-classes/org/apache/spark/sql/"
+"SparkSessionExtensionSuite.class")
+if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
+raise unittest.SkipTest(
+"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not 
"
+"available. Will skip the related tests.")
+
+# Note that 'spark.sql.extensions' is a static immutable 
configuration.
+cls.spark = SparkSession.builder \
+.master("local[4]") \
+.appName(cls.__name__) \
+.config(
+"spark.sql.extensions",
+"org.apache.spark.sql.MyExtensions") \
--- End diff --

I'm not sure what you mean here? This class already exists as part of the 
SparkSqlExtensions Suite and i'm just reusing with no changes. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-15 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r210428804
  
--- Diff: python/pyspark/sql/session.py ---
@@ -218,7 +218,9 @@ def __init__(self, sparkContext, jsparkSession=None):
 .sparkContext().isStopped():
 jsparkSession = 
self._jvm.SparkSession.getDefaultSession().get()
 else:
-jsparkSession = self._jvm.SparkSession(self._jsc.sc())
+jsparkSession = self._jvm.SparkSession.builder() \
+.sparkContext(self._jsc.sc()) \
+.getOrCreate()
--- End diff --

Yeah let me add in the test, and then I'll clear out all the python 
duplication of Scala code. I can make it more of a wrapper and less of a 
reimplementer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21988: [SPARK-25003][PYSPARK][BRANCH-2.2] Use SessionExt...

2018-08-07 Thread RussellSpitzer
Github user RussellSpitzer closed the pull request at:

https://github.com/apache/spark/pull/21988


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21989: [SPARK-25003][PYSPARK][BRANCH-2.3] Use SessionExt...

2018-08-07 Thread RussellSpitzer
Github user RussellSpitzer closed the pull request at:

https://github.com/apache/spark/pull/21989


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21988: [SPARK-25003][PYSPARK][BRANCH-2.2] Use SessionExtensions...

2018-08-06 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21988
  
@felixcheung I just didn't know what version to target so I made a a PR for 
each one. We can just close the ones that shouldn't be merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21989: [SPARK-25003][PYSPARK][BRANCH-2.3] Use SessionExtensions...

2018-08-06 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21989
  
@kiszk sure, it all depends which branch the merge target should be I 
wasn't sure which one was being used for changes of this nature. Technically 
it's a bug fix I believe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21988: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-08-03 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21988
  
Local PEP didn't seem to mind this code ... Fixed up the indentation so 
hopefully jenkins will like it now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-03 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/21990

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

Master

## What changes were proposed in this pull request?

Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.

## How was this patch tested?

This was manually tested by passing a class to spark.sql.extensions and 
making sure it's included strategies appeared in the 
spark._jsparkSession.sessionState.planner.strategies list. We could add a 
automatic test but i'm not very familiar with the Pyspark Testing framework. 
But I would be glad to implement that if requested.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-25003-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21990.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21990


commit f790ae000ae1c3f4030162028186448d345e2984
Author: Russell Spitzer 
Date:   2018-08-03T16:04:00Z

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21989: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-03 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/21989

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

(Branch-2.3)

##What changes were proposed in this pull request?
Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.

##How was this patch tested?
This was manually tested by passing a class to spark.sql.extensions and 
making sure it's included strategies appeared in the 
spark._jsparkSession.sessionState.planner.strategies list. We could add a 
automatic test but i'm not very familiar with the Pyspark Testing framework. 
But I would be glad to implement that if requested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-25003-branch-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21989.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21989






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21988: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-03 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/21988

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

## What changes were proposed in this pull request?

Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.

## How was this patch tested?

This was manually tested by passing a class to spark.sql.extensions and 
making sure it's included strategies appeared in the 
spark._jsparkSession.sessionState.planner.strategies list. We could add a 
automatic test but i'm not very familiar with the Pyspark Testing framework. 
But I would be glad to implement that if requested.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-25003-branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21988.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21988


commit 9c6e0671bf7311e2647e774c8d247c43037fc12c
Author: Russell Spitzer 
Date:   2018-08-03T15:48:15Z

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21453: Test branch to see how Scala 2.11.12 performs

2018-05-29 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/21453

Test branch to see how Scala 2.11.12 performs

This may be useful when Java 8 is no longer supported since
Scala 2.11.12 supports later versions of Java

## What changes were proposed in this pull request?

Change Scala Build Version to 2.11.12.

## How was this patch tested?

This PR is made to run 2.11.12 Scala through Jenkins to see whether or not 
it passes cleanly.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark Scala2_11_12_Test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21453


commit 90d3842616aec94d603a68d44463eb043c5a66f9
Author: Russell Spitzer 
Date:   2018-05-29T18:17:23Z

Test branch to see how Scala 2.11.12 performs

This may be useful when Java 8 is no longer supported since
Scala 2.11.12 supports later versions of Java




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20190: [SPARK-22976][Core]: Cluster mode driver director...

2018-01-17 Thread RussellSpitzer
Github user RussellSpitzer closed the pull request at:

https://github.com/apache/spark/pull/20190


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20190: [SPARK-22976][Core]: Cluster mode driver directories can...

2018-01-17 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/20190
  
@jerryshao https://github.com/apache/spark/pull/20298


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20298: [SPARK-22976][Core]: Cluster mode driver dir remo...

2018-01-17 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/20298

[SPARK-22976][Core]: Cluster mode driver dir removed while running

## What changes were proposed in this pull request?

The clean up logic on the worker perviously determined the liveness of a
particular applicaiton based on whether or not it had running executors.
This would fail in the case that a directory was made for a driver
running in cluster mode if that driver had no running executors on the
same machine. To preserve driver directories we consider both executors
and running drivers when checking directory liveness.

## How was this patch tested?

Manually started up two node cluster with a single core on each node. 
Turned on worker directory cleanup and set the interval to 1 second and 
liveness to one second. Without the patch the driver directory is removed 
immediately after the app is launched. With the patch it is not


### Without Patch
```
INFO  2018-01-05 23:48:24,693 Logging.scala:54 - Asked to launch driver 
driver-20180105234824-
INFO  2018-01-05 23:48:25,293 Logging.scala:54 - Changing view acls to: 
cassandra
INFO  2018-01-05 23:48:25,293 Logging.scala:54 - Changing modify acls to: 
cassandra
INFO  2018-01-05 23:48:25,294 Logging.scala:54 - Changing view acls groups 
to:
INFO  2018-01-05 23:48:25,294 Logging.scala:54 - Changing modify acls 
groups to:
INFO  2018-01-05 23:48:25,294 Logging.scala:54 - SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(cassandra); groups with view permissions: Set(); users  with modify 
permissions: Set(cassandra); groups with modify permissions: Set()
INFO  2018-01-05 23:48:25,330 Logging.scala:54 - Copying user jar 
file:/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar
INFO  2018-01-05 23:48:25,332 Logging.scala:54 - Copying 
/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar
INFO  2018-01-05 23:48:25,361 Logging.scala:54 - Launch Command: 
"/usr/lib/jvm/jdk1.8.0_40//bin/java" 

INFO  2018-01-05 23:48:56,577 Logging.scala:54 - Removing directory: 
/var/lib/spark/worker/driver-20180105234824-  ### << Cleaned up

-- 
One minute passes while app runs (app has 1 minute sleep built in)
--

WARN  2018-01-05 23:49:58,080 ShuffleSecretManager.java:73 - Attempted to 
unregister application app-20180105234831- when it is not registered
INFO  2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831- removed, cleanupLocalDirs = false
INFO  2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831- removed, cleanupLocalDirs = false
INFO  2018-01-05 23:49:58,082 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831- removed, cleanupLocalDirs = true
INFO  2018-01-05 23:50:00,999 Logging.scala:54 - Driver 
driver-20180105234824- exited successfully
```

With Patch
```
INFO  2018-01-08 23:19:54,603 Logging.scala:54 - Asked to launch driver 
driver-20180108231954-0002
INFO  2018-01-08 23:19:54,975 Logging.scala:54 - Changing view acls to: 
automaton
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls to: 
automaton
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing view acls groups 
to:
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls 
groups to:
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(automaton); groups with view permissions: Set(); users  with modify 
permissions: Set(automaton); groups with modify permissions: Set()
INFO  2018-01-08 23:19:55,029 Logging.scala:54 - Copying user jar 
file:/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
INFO  2018-01-08 23:19:55,031 Logging.scala:54 - Copying 
/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
INFO  2018-01-08 23:19:55,038 Logging.scala:54 - Launch Command: ..
INFO  2018-01-08 23:21:28,674 ShuffleSecretManager.java:69 - Unregistered 
shuffle secret for application app-20180108232000-
INFO  2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000- removed, cleanupLocalDirs = false
INFO  2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000- removed, cleanupLocalDirs = false
INFO  2018-01-08 23:21:28,681 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000- removed, cleanupLocalDirs = true
INFO  2018-01-08 23:21:31,703 

[GitHub] spark issue #20201: [SPARK-22389][SQL] data source v2 partitioning reporting...

2018-01-09 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/20201
  
This looks very exciting to me


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20190: [SPARK-22976][Core]: Cluster mode driver directories can...

2018-01-09 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/20190
  
@zsxwing I think you were the last to touch this code, could you please 
review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20190: [SPARK-22976][Core]: Cluster mode driver director...

2018-01-08 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/20190

[SPARK-22976][Core]: Cluster mode driver directories can be removed w…

…hile running



## What changes were proposed in this pull request?

The clean up logic on the worker perviously determined the liveness of a
particular applicaiton based on whether or not it had running executors.
This would fail in the case that a directory was made for a driver
running in cluster mode if that driver had no running executors on the
same machine. To preserve driver directories we consider both executors
and running drivers when checking directory liveness.

## How was this patch tested?

Manually started up two node cluster with a single core on each node. 
Turned on worker directory cleanup and set the interval to 1 second and 
liveness to one second. Without the patch the driver directory is removed 
immediately after the app is launched. With the patch it is not


### Without Patch
```
INFO  2018-01-05 23:48:24,693 Logging.scala:54 - Asked to launch driver 
driver-20180105234824-
INFO  2018-01-05 23:48:25,293 Logging.scala:54 - Changing view acls to: 
cassandra
INFO  2018-01-05 23:48:25,293 Logging.scala:54 - Changing modify acls to: 
cassandra
INFO  2018-01-05 23:48:25,294 Logging.scala:54 - Changing view acls groups 
to:
INFO  2018-01-05 23:48:25,294 Logging.scala:54 - Changing modify acls 
groups to:
INFO  2018-01-05 23:48:25,294 Logging.scala:54 - SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(cassandra); groups with view permissions: Set(); users  with modify 
permissions: Set(cassandra); groups with modify permissions: Set()
INFO  2018-01-05 23:48:25,330 Logging.scala:54 - Copying user jar 
file:/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar
INFO  2018-01-05 23:48:25,332 Logging.scala:54 - Copying 
/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180105234824-/writeRead-0.1.jar
INFO  2018-01-05 23:48:25,361 Logging.scala:54 - Launch Command: 
"/usr/lib/jvm/jdk1.8.0_40//bin/java" 

INFO  2018-01-05 23:48:56,577 Logging.scala:54 - Removing directory: 
/var/lib/spark/worker/driver-20180105234824-  ### << Cleaned up

-- 
One minute passes while app runs (app has 1 minute sleep built in)
--

WARN  2018-01-05 23:49:58,080 ShuffleSecretManager.java:73 - Attempted to 
unregister application app-20180105234831- when it is not registered
INFO  2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831- removed, cleanupLocalDirs = false
INFO  2018-01-05 23:49:58,081 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831- removed, cleanupLocalDirs = false
INFO  2018-01-05 23:49:58,082 ExternalShuffleBlockResolver.java:163 - 
Application app-20180105234831- removed, cleanupLocalDirs = true
INFO  2018-01-05 23:50:00,999 Logging.scala:54 - Driver 
driver-20180105234824- exited successfully
```

With Patch
```
INFO  2018-01-08 23:19:54,603 Logging.scala:54 - Asked to launch driver 
driver-20180108231954-0002
INFO  2018-01-08 23:19:54,975 Logging.scala:54 - Changing view acls to: 
automaton
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls to: 
automaton
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing view acls groups 
to:
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - Changing modify acls 
groups to:
INFO  2018-01-08 23:19:54,976 Logging.scala:54 - SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(automaton); groups with view permissions: Set(); users  with modify 
permissions: Set(automaton); groups with modify permissions: Set()
INFO  2018-01-08 23:19:55,029 Logging.scala:54 - Copying user jar 
file:/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
INFO  2018-01-08 23:19:55,031 Logging.scala:54 - Copying 
/home/automaton/writeRead-0.1.jar to 
/var/lib/spark/worker/driver-20180108231954-0002/writeRead-0.1.jar
INFO  2018-01-08 23:19:55,038 Logging.scala:54 - Launch Command: ..
INFO  2018-01-08 23:21:28,674 ShuffleSecretManager.java:69 - Unregistered 
shuffle secret for application app-20180108232000-
INFO  2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000- removed, cleanupLocalDirs = false
INFO  2018-01-08 23:21:28,675 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000- removed, cleanupLocalDirs = false
INFO  2018-01-08 23:21:28,681 ExternalShuffleBlockResolver.java:163 - 
Application app-20180108232000- removed, cleanupLocalDirs = true
 

[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2

2017-09-07 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r137531270
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.Serializable;
+
+/**
+ * A read task returned by a data source reader and is responsible to 
create the data reader.
+ * The relationship between `ReadTask` and `DataReader` is similar to 
`Iterable` and `Iterator`.
+ *
+ * Note that, the read task will be serialized and sent to executors, then 
the data reader will be
+ * created on executors and do the actual reading.
+ */
+public interface ReadTask extends Serializable {
+  /**
+   * The preferred locations for this read task to run faster, but Spark 
can't guarantee that this
+   * task will always run on these locations. Implementations should make 
sure that it can
+   * be run on any location.
+   */
+  default String[] preferredLocations() {
--- End diff --

These have previously only been ip/hostnames. To match the RDD definition I 
think we would have to continue with that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2

2017-09-06 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r137333974
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import org.apache.spark.annotation.Experimental;
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder;
+import org.apache.spark.sql.catalyst.encoders.RowEncoder;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * The main interface and minimal requirement for a data source reader. 
The implementations should
+ * at least implement the full scan logic, users can mix in more 
interfaces to implement scan
+ * optimizations like column pruning, filter push down, etc.
+ *
+ * There are mainly 2 kinds of scan optimizations:
+ *   1. push operators downward to the data source, e.g., column pruning, 
filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, 
report ordering, etc.
+ * Spark first applies all operator push down optimizations which this 
data source supports. Then
+ * Spark collects information this data source provides for further 
optimizations. Finally Spark
+ * issues the scan request and does the actual data reading.
--- End diff --

This would be really nice imho.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10655: [SPARK-12639][SQL] Improve Explain for Datasource...

2016-09-15 Thread RussellSpitzer
Github user RussellSpitzer closed the pull request at:

https://github.com/apache/spark/pull/10655


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10655: [SPARK-12639][SQL] Improve Explain for Datasources with ...

2016-09-15 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/10655
  
We fixed this on a different pr https://github.com/apache/spark/pull/11317


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11796: [SPARK-13579][build][test-maven] Stop building th...

2016-08-05 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/11796#discussion_r73781398
  
--- Diff: assembly/pom.xml ---
@@ -69,6 +68,17 @@
   spark-repl_${scala.binary.version}
   ${project.version}
 
+
+
+
--- End diff --

This is a problem for the Spark Cassandra Connector. The Cassandra Java 
Driver requires a 16.0 or greater version of guava. This necessarily means we 
need to shade now. This was on our roadmap anyway just wanted you to be aware.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11317: [SPARK-12639] [SQL] Mark Filters Fully Handled By Source...

2016-07-08 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/11317
  
Updated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13652: [SPARK-15613] [SQL] Fix incorrect days to millis convers...

2016-06-15 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/13652
  
I would love to be able to just specify days since epoch rather than using 
java.sql.Date


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13652: [SPARK-] Fix incorrect days to millis conversion

2016-06-14 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/13652
  
I think this is unfortunately the right thing to do. I wish we didn't have 
to use java.sql.Date :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12639] [SQL] Mark Filters Fully Handled...

2016-05-09 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/11317#issuecomment-218052162
  
I don't think this is because of me

```
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:834), pid=27637, tid=14737998592
#  fatal error: exception happened outside interpreter, nmethods and vtable 
stubs at pc 0x7f54dd188231
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/jenkins/workspace/SparkPullRequestBuilder@2/hs_err_pid27637.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Traceback (most recent call last):
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 295, in 
_handle_request_noblock
self.process_request(request, client_address)
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 321, in 
process_request
self.finish_request(request, client_address)
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 334, in 
finish_request
self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/SocketServer.py", line 655, in 
__init__
self.handle()
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/accumulators.py",
 line 235, in handle
num_updates = read_int(self.rfile)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/serializers.py",
 line 545, in read_int
raise EOFError
EOFError
ERROR:py4j.java_gateway:An error occurred while trying to connect to the 
Java server (127.0.0.1:40970)
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py",
 line 713, in start
self.socket.connect((self.address, self.port))
  File "", line 1, in connect
error: [Errno 111] Connection refused
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12639] [SQL] Mark Filters Fully Handled...

2016-05-09 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/11317#issuecomment-218030539
  
@HyukjinKwon + @yhuai  Sorry it took so long! Things have been busy :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-05-06 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-217483240
  
Sorry I forgot about this, I'll clean this up tomorrow and get it ready


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12639] [SQL] Mark Filters Fully Handled...

2016-02-22 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/11317

[SPARK-12639] [SQL] Mark Filters Fully Handled By Sources with *

## What changes were proposed in this pull request?

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with a *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.

## How was the this patch tested?

Manually tested with the Spark Cassandra Connector, a source which fully 
handles underlying filters. Now fully handled filters appear with an * next to 
their names. I can add an automated test as well if requested,

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-12639-Star

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11317.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11317


commit c33daaee15fea96129ae91a899ec70429e4f66ec
Author: Russell Spitzer <russell.spit...@gmail.com>
Date:   2016-01-26T22:39:38Z

SPARK-12639 SQL Mark Filters Fully Handled By Sources with *

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with a *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Mark Filters Fully Handled By ...

2016-02-22 Thread RussellSpitzer
Github user RussellSpitzer closed the pull request at:

https://github.com/apache/spark/pull/10929


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Mark Filters Fully Handled By ...

2016-01-26 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/10929

SPARK-12639 SQL Mark Filters Fully Handled By Sources with *

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with a *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-12639-Star

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10929.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10929


commit 40a75a4bd0075a42e4d28c849987c37d017c44c4
Author: Russell Spitzer <russell.spit...@gmail.com>
Date:   2016-01-26T22:39:38Z

SPARK-12639 SQL Mark Filters Fully Handled By Sources with *

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with a *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13021][CORE] Fail fast when custom RDDs...

2016-01-26 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/10932#issuecomment-175294425
  
I'm +1 on this in 2.0 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-22 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-174113049
  
Haven't forgotten this will have a new pr soon :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-15 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-172145152
  
I personally think the ambiguous `PUSHED_FILTERS` is more confusing. When 
we see a predicate there we have no idea whether or not it is a valid filter 
for the source at all. Like in the C* case this could contain clauses which 
have no way have being actually pushed down to the source. 

In my mind their are 3 Categories of predicates
* Those which cannot be pushed to the source at all
* Those which can be pushed to the source but may have false positives 
* Those which can be pushed to the source and filter completely 

Currently we can only tell whether or not a predicate is in on of the first 
two categories or if it is in the third. This leaves is awkwardly stating that 
the source has had a predicate `Pushed` to it even when that is impossible. I 
like just stating the Third category because thats the only thing we truly can 
be sure of given the current code. It would be better if the underlying source 
was able to qualify all filters into the above categories.

So to me it is more confusing to say something is `Pushed` when it 
technically can't be than to say something is not `Pushed` when it might be. 
But ymmv

If you want to just go with Asterisks thats fine with me too just wanted to 
make my argument :D


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-15 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-172136871
  
@yhuai I removed the PushedFilters and add the other examples. We could 
read-add the "PushedFilters" if you like. I wasn't sure if you still wanted 
that. I'm still not sure if it's very valuable info since everything is 
`Pushed` if I understood.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-11 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/10655#discussion_r49342098
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -114,6 +114,7 @@ private[sql] object PhysicalRDD {
   // Metadata keys
   val INPUT_PATHS = "InputPaths"
   val PUSHED_FILTERS = "PushedFilters"
+  val HANDLED_FILTERS = "HandledFilters"
--- End diff --

sgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-08 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-170064440
  
@rxin Added, basically I think the current "PushedFilters" list isn't very 
valuable if everything is listed there. So instead we should just list those 
filters which the source can actually do something with. If there is a 
possibility that a source might do something (bloom flitery) we should have a 
third category but currently there is no way of knowing (i think)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-07 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/10655#discussion_r49153042
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -114,6 +114,7 @@ private[sql] object PhysicalRDD {
   // Metadata keys
   val INPUT_PATHS = "InputPaths"
   val PUSHED_FILTERS = "PushedFilters"
+  val HANDLED_FILTERS = "HandledFilters"
--- End diff --

How about `FilteredAtSource` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-07 Thread RussellSpitzer
Github user RussellSpitzer commented on a diff in the pull request:

https://github.com/apache/spark/pull/10655#discussion_r49152876
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -321,8 +321,8 @@ private[sql] object DataSourceStrategy extends Strategy 
with Logging {
 val metadata: Map[String, String] = {
   val pairs = ArrayBuffer.empty[(String, String)]
 
-  if (pushedFilters.nonEmpty) {
-pairs += (PUSHED_FILTERS -> pushedFilters.mkString("[", ", ", "]"))
+  if (handledPredicates.nonEmpty) {
+pairs += (HANDLED_FILTERS -> handledPredicates.mkString("[", ", ", 
"]"))
--- End diff --

I thought 11663 meant all filters are pushed down, regardless so I wondered 
if that was redundant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-07 Thread RussellSpitzer
GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/10655

SPARK-12639 SQL Improve Explain for Datasources with Handled Predicates

SPARK-11661 Makes all predicates pushed down to underlying Datasources
regardless of whether the source can handle them or not. This makes the
explain command slightly confusing as it will always list all filters
whether or not the underlying source can actually use them. Instead
now we should only list those filters which are expressly handled by the
underlying source.

All predicates are pushed down so there really isn't any value in listing 
them.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark SPARK-12639

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10655.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10655


commit 42cc0e76148c39d84a5635058698262862a60e6f
Author: Russell Spitzer <russell.spit...@gmail.com>
Date:   2016-01-08T01:33:59Z

SPARK-12639: Improve Explain for Datasources with Handled Predicates

SPARK-11661 Makes all predicates pushed down to underlying Datasources
regardless of whether the source can handle them or not. This makes the
explain command slightly confusing as it will always list all filters
whether or not the underlying source can actually use them. Instead
now we should only list those filters which are expressly handled by the
underlying source.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-11415: Remove timezone shift of Catalyst...

2015-11-18 Thread RussellSpitzer
Github user RussellSpitzer commented on the pull request:

https://github.com/apache/spark/pull/9369#issuecomment-157886376
  
np


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-11415: Remove timezone shift of Catalyst...

2015-11-18 Thread RussellSpitzer
Github user RussellSpitzer closed the pull request at:

https://github.com/apache/spark/pull/9369


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org