[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-02-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16537
  
I don't think it is a good idea to think that this has little use because 
it is a dumb mistake to pass something that isn't callable. In this case, it's 
easy to accidentally reuse a name for a function and a variable (e.g., 
`format`), especially as scripts change over time and pass from one maintainer 
to another.

Spark should have reasonable behavior for any error, as opposed to being 
harder to work with because we thought the user wasn't likely to hit a 
particular problem. This is very few lines of code that will make a user's 
experience much better because it can catch exactly what the problem is, 
without running a job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-02-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16537
  
Sorry, my example was for validating the object passed to `udf` was 
callable, not for the use of the UDF. I still think it's a good idea not to 
make assumptions about how a user makes a mistake. Error checking should be 
proportional to how difficult it is to find what happened, not how likely it is 
to happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-02-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16537
  
Yeah, I thought this was the other PR that validates the function is 
callable. Still, I don't agree that it's okay for python to be less friendly as 
long as we don't think people will hit the problem too much or because they 
solve the problem before asking a list. There are reasonable ways to hit this 
and Spark should give a good explanation about what went wrong.

I'm not saying we have to go fix all of the cases like this, but where 
there's a PR ready to go I think it's better to include it than not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-02-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16537
  
Maybe we're at an agree to disagree situation, but I think we may be 
talking about different things. If you're saying that we should try to keep 
these together to make reviews easier, I'd agree. I was under the impression 
that this change may be rejected because it isn't important enough of a 
problem, which I think isn't a good way of looking at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16519: SPARK-19138: Don't return SparkSession for stoppe...

2017-01-09 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/16519

SPARK-19138: Don't return SparkSession for stopped SparkContext.

## What changes were proposed in this pull request?

* Update SparkSession to always return a session using the current 
SparkContext
* Add SparkContext#isStopped

## How was this patch tested?

Tested that sqlContext.sql works when created after closing a Spark context:

```python
sc.stop()
sc = SparkContext(conf=SparkConf())
sqlContext = HiveContext(sc)
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-19138-pyspark-stale-spark-context

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16519.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16519


commit 6d56df1e0d8c95e04eea41cd9776c72b34f9998b
Author: Ryan Blue 
Date:   2017-01-09T21:28:32Z

SPARK-19138: Don't return SparkSession for stopped SparkContext.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16519: SPARK-19138: Don't return SparkSession for stopped Spark...

2017-01-09 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16519
  
Yeah, it looks like this is basically the same problem. I'll add some 
review comments to the other issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession initializat...

2017-01-09 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16454
  
I just posted a fix to this also. I'll close that one in favor of this and 
add comments here for what it did differently that we should consider.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...

2017-01-09 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16454#discussion_r95258453
  
--- Diff: python/pyspark/sql/session.py ---
@@ -161,8 +161,8 @@ def getOrCreate(self):
 with self._lock:
 from pyspark.context import SparkContext
 from pyspark.conf import SparkConf
-session = SparkSession._instantiatedContext
-if session is None:
+session = SparkSession._instantiatedSession
+if session is None or session._sc._jsc is None:
--- End diff --

#16519 adds `isStopped`, which I think is a better way of checking whether 
the Spark context is valid than just checking whether `_jsc` is defined because 
`_jsc` could be stopped.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...

2017-01-09 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16454#discussion_r95259073
  
--- Diff: python/pyspark/sql/session.py ---
@@ -214,8 +214,12 @@ def __init__(self, sparkContext, jsparkSession=None):
 self._wrapped = SQLContext(self._sc, self, self._jwrapped)
 _monkey_patch_RDD(self)
 install_exception_handler()
-if SparkSession._instantiatedContext is None:
-SparkSession._instantiatedContext = self
+# If we had an instantiated SparkSession attached with a 
SparkContext
+# which is stopped now, we need to renew the instantiated 
SparkSession.
+# Otherwise, we will use invalid SparkSession when we call 
Builder.getOrCreate.
+if SparkSession._instantiatedSession is None \
+or SparkSession._instantiatedSession._sc._jsc is None:
--- End diff --

What about setting `SparkSession._instantiatedSession` to `None` in 
`getOrCreate` instead of duplicating the logic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16519: SPARK-19138: Don't return SparkSession for stoppe...

2017-01-09 Thread rdblue
Github user rdblue closed the pull request at:

https://github.com/apache/spark/pull/16519


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16519: SPARK-19138: Don't return SparkSession for stopped Spark...

2017-01-09 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16519
  
Closing in favor of #16454.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16537#discussion_r95484684
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_should_validate_input_args(self):
+from pyspark.sql.functions import udf
+
+self.assertRaises(TypeError, udf(lambda x: x), None)
--- End diff --

I think this should have positive tests for a column and a string as well 
as a negative test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16537#discussion_r95484969
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1848,6 +1848,12 @@ def __del__(self):
 self._broadcast = None
 
 def __call__(self, *cols):
+for c in cols:
+if not isinstance(c, (Column, str)):
+raise TypeError(
+"All arguments should be Columns or strings 
representing column names. "
--- End diff --

"All arguments" is a little vague, since this is going to be called later 
in code than the UDF definition. What about an error message like "Invalid UDF 
argument, not a String or Column: {0} of type {1}"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16537#discussion_r95486774
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1848,6 +1848,12 @@ def __del__(self):
 self._broadcast = None
 
 def __call__(self, *cols):
+for c in cols:
+if not isinstance(c, (Column, str)):
+raise TypeError(
+"All arguments should be Columns or strings 
representing column names. "
--- End diff --

A literal ends up being a `Column`, aren't the others as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16535#discussion_r95486972
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_shouldnt_accept_noncallable_object(self):
+from pyspark.sql.functions import udf
+
+self.assertRaises(TypeError, udf)
--- End diff --

I think this should have positive tests for a function and a class that 
defines `__call__`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16535#discussion_r95487983
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1824,6 +1824,12 @@ class UserDefinedFunction(object):
 .. versionadded:: 1.3
 """
 def __init__(self, func, returnType, name=None):
+if not callable(func):
+raise TypeError(
+"func should be a callable object "
--- End diff --

I think the error message could be more clear: "Not a function or callable 
(__call__ is not defined): {}"

I don't think it's necessary to use the name of the argument, but it isn't 
a problem if you think it is better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16535#discussion_r95488127
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_shouldnt_accept_noncallable_object(self):
+from pyspark.sql.functions import udf
+
+self.assertRaises(TypeError, udf)
--- End diff --

Why doesn't this test call udf?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16533#discussion_r95490867
  
--- Diff: python/pyspark/sql/decorators.py ---
@@ -0,0 +1,25 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.sql.types import StringType
+from pyspark.sql.functions import udf as _udf
+
+
+def udf(returnType=StringType()):
--- End diff --

I don't think this should conflict with pyspark.sql.functions.udf. If this 
were defined, then it would break existing code that uses functions.udf. It may 
be possible to have a decorator definition that maintains that API and acts as 
a decorator, but otherwise this should be renamed to something like 
`register_udf`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16533#discussion_r95491058
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,17 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_with_decorator(self):
+from pyspark.sql.decorators import udf
+from pyspark.sql.types import IntegerType
+
+@udf(IntegerType())
--- End diff --

Could you add a test case for the default return type? I think it currently 
requires empty-parens:

```
@udf()
def truncate(s):
return s[:10]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-10 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16533#discussion_r95491193
  
--- Diff: python/pyspark/sql/decorators.py ---
@@ -0,0 +1,25 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.sql.types import StringType
+from pyspark.sql.functions import udf as _udf
+
+
+def udf(returnType=StringType()):
--- End diff --

Can you also add docstrings for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-10 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16533
  
I think if the `udf` method takes as its first arg `func=None`, then you 
can match the old API and take care of the empty-paren problem. If a callable 
isn't passed as the first arg, then return a decorator function. If a callable 
is passed in, then register it. You can do something similar with `*args` if 
you don't want to require passing `returnType` as a named arg.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-01-11 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16535#discussion_r95627424
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_shouldnt_accept_noncallable_object(self):
+from pyspark.sql.functions import udf
+
+self.assertRaises(TypeError, udf)
--- End diff --

Sorry if my wording was confusing. My point was that it called `udf` as a 
test method without arguments, which isn't the call you wanted to test. The 
update that passes `non_callable` is what I was looking for.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFunction shou...

2017-01-11 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16535
  
+1. Thanks for working on this @zero323!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-11 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16533#discussion_r95654895
  
--- Diff: python/pyspark/sql/decorators.py ---
@@ -0,0 +1,25 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.sql.types import StringType
+from pyspark.sql.functions import udf as _udf
+
+
+def udf(returnType=StringType()):
--- End diff --

I think that the two should be merged. If they have the same name, then 
what works in one place breaks in others and the two are mutually exclusive. 
Our users share code quite a bit and it is common to copy & paste UDFs. That 
should be fine, or should result in "NameError: name 'udf' is not defined", 
rather than having different behavior.

With the changes to the decorator to support using it without parens, it's 
already close to being compatible with the current `functions.udf`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...

2017-01-12 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16454#discussion_r95846835
  
--- Diff: python/pyspark/sql/session.py ---
@@ -214,8 +214,12 @@ def __init__(self, sparkContext, jsparkSession=None):
 self._wrapped = SQLContext(self._sc, self, self._jwrapped)
 _monkey_patch_RDD(self)
 install_exception_handler()
-if SparkSession._instantiatedContext is None:
-SparkSession._instantiatedContext = self
+# If we had an instantiated SparkSession attached with a 
SparkContext
+# which is stopped now, we need to renew the instantiated 
SparkSession.
+# Otherwise, we will use invalid SparkSession when we call 
Builder.getOrCreate.
+if SparkSession._instantiatedSession is None \
+or SparkSession._instantiatedSession._sc._jsc is None:
--- End diff --

If this can't tell whether the java context is stopped, then it should be 
fine to use this logic. +1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...

2017-01-12 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16454#discussion_r95846995
  
--- Diff: python/pyspark/sql/session.py ---
@@ -161,8 +161,8 @@ def getOrCreate(self):
 with self._lock:
 from pyspark.context import SparkContext
 from pyspark.conf import SparkConf
-session = SparkSession._instantiatedContext
-if session is None:
+session = SparkSession._instantiatedSession
+if session is None or session._sc._jsc is None:
--- End diff --

Yeah, I had to change to just checking whether _jsc was defined in our 
branch for the same reason. I'd prefer to add isStopped to Java and Python 
eventually, but that doesn't need to block this commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.

2016-10-05 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15088
  
@srowen, could you take another look at this? It has been useful for us to 
catch locking issues and I'd like to get it in for 2.1.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17412: [SPARK-20084][Core] Remove internal.metrics.updat...

2017-03-24 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/17412

[SPARK-20084][Core] Remove internal.metrics.updatedBlockStatuses from 
history files.

## What changes were proposed in this pull request?

Remove accumulator updates for internal.metrics.updatedBlockStatuses from 
SparkListenerTaskEnd entries in the history file. These can cause history files 
to grow to hundreds of GB because the value of the accumulator contains all 
tracked blocks.

## How was this patch tested?

Current History UI tests cover use of the history file.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-20084-remove-block-accumulator-info

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17412.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17412


commit 386846c224842ba299a198ca296cfc2b72b188cc
Author: Ryan Blue 
Date:   2017-03-24T18:21:19Z

SPARK-20084: Remove internal.metrics.updatedBlockStatuses from history 
files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17412: [SPARK-20084][Core] Remove internal.metrics.updatedBlock...

2017-03-27 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/17412
  
@vanzin, I looked through the listeners in the Spark UI for uses of it and 
didn't find any, plus I haven't seen anything in the UI that is obviously based 
on it. As I said on the issue, the updates duplicate data in metrics so it 
would be odd for a component to go to accumulators to get this info.

I'm deploying this internally today, maybe it makes sense to give it a 
couple days to see if we get complaints about something not working.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17412: [SPARK-20084][Core] Remove internal.metrics.updatedBlock...

2017-03-30 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/17412
  
@vanzin, we've been running this in production for a few days now and 
haven't had any problems. I think it should be safe to merge. Could you take 
another look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operat...

2017-04-05 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/17540

[SPARK-20213][SQL][UI] Fix DataFrameWriter operations in SQL UI tab.

## What changes were proposed in this pull request?

Wraps `DataFrameWriter` operations in `SQLExecution.withNewExecutionId` so 
that `SparkListenerSQLExecutionStart` and `SparkListenerSQLExecutionEnd` are 
sent and the query shows up in the SQL tab of the UI.

## How was this patch tested?

Tested by hand that `insertInto` results in queries in the SQL tab.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark SPARK-20213-fix-sql-tab

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17540.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17540


commit f9342b5fea4fd3dbe32c12930f96aa9af38c386b
Author: Ryan Blue 
Date:   2017-04-05T17:28:19Z

SPARK-20213: Fix DataFrameWriter operations in SQL UI tab.

Not wrapping the execution started by DataFrameWriter in
SQLExecution.withNewExecutionId causes those executions to not show up
in the SQL UI tab because the SparkListenerSQLExecutionStart and
SparkListenerSQLExecutionEnd messages are never sent.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-05 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/17540
  
@srowen, agreed. Closely related but not the same code paths. The question 
is: when should `withNewExecutionId` get called?

I'm running the test suite now and this patch causes test failures when 
`withNewExecutionId` is called twice; once in `DataFrameWriter` and once in 
`InsertIntoHadoopFsRelationCommand`. It looks like the call has now been 
littered about the codebase (e.g. in `InsertIntoHadoopFsRelationCommand` and 
other execution nodes) to fix this problem on certain operations, so we should 
decide where it should be used and fix tests around that.

The reason why I added it to `DataFrameWriter` is that it is called in 
`Dataset` actions, and it makes sense to call it once from where an action is 
started. I think it makes the most sense for action methods, like 
`Dataset#collect` or `DataFrameWriter#insertInto` to minimize the number of 
places we need to add it. I don't think this is a concern that should be 
addressed by the execution plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-06 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/17540
  
@cloud-fan, can you look at this? What do you think about the question 
above: when should `withNewExecutionId` get called?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...

2017-04-06 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/17540
  
The new test failures are caused by a check I inserted. Moving where 
`withNewExecutionId` gets called could result in missing SQL queries in the UI, 
so anywhere I'm replacing `withNewExecutionId` with `checkSQLExecutionId` that 
will cause tests (and only tests) to fail. That way, we can catch all of the 
call stacks that should have it.

This caught problems in SQL command execution and I've added a patch to fix 
it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17106: [SPARK-19775][SQL] Remove an obsolete `partitionBy().ins...

2017-02-28 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/17106
  
+1

Looks fine to me. `partitionBy` support with `insertInto` was removed. This 
is probably still passing because it expects an analysis exception that is 
thrown for a different reason. We should update assertions like this to verify 
the error message to avoid that in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15816: SPARK-18368: Fix regexp_replace with task seriali...

2016-11-08 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/15816

SPARK-18368: Fix regexp_replace with task serialization.

## What changes were proposed in this pull request?

This makes the result value both transient and lazy, so that if the 
RegExpReplace object is initialized then serialized, `result: StringBuffer` 
will be correctly initialized.

## How was this patch tested?

Verified that this patch fixed the query that found the bug.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark SPARK-18368-fix-regexp-replace

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15816.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15816


commit 8740a2c3b918796e1ac87c3e178f9f9f4651cb75
Author: Ryan Blue 
Date:   2016-10-26T16:42:43Z

SPARK-18368: Fix regexp_replace with task serialization.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15816: SPARK-18368: Fix regexp_replace with task serialization.

2016-11-08 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15816
  
@hvanhovell, I updated the test so that all expressions passed through 
`checkEvaluation` are serialized and deserialized before the checks are run. 
I'll open an issue if it catches anything else. I'd rather not hold up this PR 
if it finds lots of other bugs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15816: SPARK-18368: Fix regexp_replace with task serialization.

2016-11-09 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15816
  
Thanks @rxin and @hvanhovell! I appreciate the quick reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15834: [SPARK-18368] [SQL] Fix regexp replace when seria...

2016-11-09 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/15834

[SPARK-18368] [SQL] Fix regexp replace when serialized

## What changes were proposed in this pull request?

This makes the result value both transient and lazy, so that if the 
RegExpReplace object is initialized then serialized, `result: StringBuffer` 
will be correctly initialized.

## How was this patch tested?

* Verified that this patch fixed the query that found the bug.
* Added a test case that fails without the fix.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark SPARK-18368-fix-regexp-replace

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15834.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15834


commit 8740a2c3b918796e1ac87c3e178f9f9f4651cb75
Author: Ryan Blue 
Date:   2016-10-26T16:42:43Z

SPARK-18368: Fix regexp_replace with task serialization.

commit 3536f6a44f15e2997a1061a6e7ec061f98c3ef9f
Author: Ryan Blue 
Date:   2016-11-08T23:40:25Z

SPARK-18368: Add test for regexp_replace serialization.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15834: [SPARK-18368] [SQL] Fix regexp replace when serialized

2016-11-09 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15834
  
@yhuai, this replaces #15816. The ref,  3536f6a, has already passed 
tests on that PR so it should be safe to merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15834: [SPARK-18368] [SQL] Fix regexp replace when serialized

2016-11-09 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15834
  
[SPARK-18387](https://issues.apache.org/jira/browse/SPARK-18387) tracks the 
other bugs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15847: SPARK-18368: Add serialization to checkEvaluation...

2016-11-10 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/15847

SPARK-18368: Add serialization to checkEvaluation.

## What changes were proposed in this pull request?

This removes the serialization test from RegexpExpressionsSuite and
replaces it by serializing all expressions in checkEvaluation.

This also fixes math constant expressions by making LeafMathExpression
Serializable and fixes NumberFormat values that are null or invalid
after serialization.

## How was this patch tested?

This patch is to tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-18387-fix-serializable-expressions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15847


commit 8e829ae87b197de2ff4b8777202a47d5f1204c56
Author: Ryan Blue 
Date:   2016-11-09T00:16:11Z

SPARK-18368: Add serialization to checkEvaluation.

This removes the serialization test from RegexpExpressionsSuite and
replaces it by serializing all expressions in checkEvaluation.

This also fixes math constant expressions by making LeafMathExpression
Serializable and fixes NumberFormat values that are null or invalid
after serialization.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15847: [SPARK-18387] [SQL] Add serialization to checkEva...

2016-11-11 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/15847#discussion_r87619721
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
 ---
@@ -36,7 +36,7 @@ import org.apache.spark.unsafe.types.UTF8String
  * @param name The short name of the function
  */
 abstract class LeafMathExpression(c: Double, name: String)
-  extends LeafExpression with CodegenFallback {
+  extends LeafExpression with CodegenFallback with Serializable {
--- End diff --

The case classes are serializable, but the superclass must be also. If this 
isn't present, then no default constructor is created for this abstract class 
and its children fail when they are deserialized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15847: [SPARK-18387] [SQL] Add serialization to checkEva...

2016-11-11 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/15847#discussion_r87620051
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -1431,43 +1431,49 @@ case class FormatNumber(x: Expression, d: 
Expression)
 
   // Associated with the pattern, for the last d value, and we will update 
the
   // pattern (DecimalFormat) once the new coming d value differ with the 
last one.
+  // This is an Option to distinguish between 0 (numberFormat is valid) 
and uninitialized after
+  // serialization (numberFormat has not been updated for dValue = 0).
   @transient
-  private var lastDValue: Int = -100
+  private var lastDValue: Option[Int] = None
--- End diff --

This is a variable so it can't be lazily initialized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15847: [SPARK-18387] [SQL] Add serialization to checkEvaluation...

2016-11-11 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15847
  
Thanks! I just went to cherry-pick the commit, but it was already there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14720: SPARK-12868: Allow Add jar to add jars from hdfs/...

2016-11-14 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/14720#discussion_r87908473
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
 ---
@@ -856,6 +856,17 @@ class HiveQuerySuite extends HiveComparisonTest with 
BeforeAndAfter {
 sql("DROP TABLE alter1")
   }
 
+  test("SPARK-12868 ADD JAR FROM HDFS") {
+val testJar = "hdfs://nn:8020/foo.jar"
+// This should fail with unknown host, as its just testing the URL 
parsing
+// before SPARK-12868 it was failing with Malformed URI
+val e = intercept[RuntimeException] {
--- End diff --

I think this test should be improved before merging this. Looking for a 
RuntimeException to validate that the Jar was registered is brittle and can 
easily pass when the registration doesn't actually work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15062: SPARK-17424: Fix unsound substitution bug in ScalaReflec...

2016-11-21 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15062
  
I tried to reproduce with #10970 reverted, but I didn't hit the issue in 
testing. I still think it's fine to move forward on this, even if it is hard to 
reproduce because we know the code is wrong and this fixes it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15062: SPARK-17424: Fix unsound substitution bug in ScalaReflec...

2016-11-21 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15062
  
We were seeing the problem when using the datasets API in our 1.6.1 build, 
which is based on Scala 2.10. I recently tried to reproduce this on master with 
2.11 and #10970 reverted, but I didn't get a case that failed. Either way, I 
think the fix here makes sense: if there are no types to substitute, don't do 
it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15062: SPARK-17424: Fix unsound substitution bug in ScalaReflec...

2016-11-21 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15062
  
This does look the same as SPARK-17109. Does this fix that issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.

2016-11-01 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15088
  
@srowen, if you have a chance, could you look at this again? I think it 
will be helpful for tracking down live-lock issues. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13880: SPARK-16178: Remove unnecessary Hive partition ch...

2016-11-01 Thread rdblue
Github user rdblue closed the pull request at:

https://github.com/apache/spark/pull/13880


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.

2016-11-02 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15088
  
Thanks @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15738: SPARK-18086: Add support for Hive session vars.

2016-11-02 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/15738

SPARK-18086: Add support for Hive session vars.

## What changes were proposed in this pull request?

This adds support for Hive variables:

* Makes values set via `spark-sql --hivevar name=value` accessible
* Adds `getHiveVar` and `setHiveVar` to the `HiveClient` interface
* Adds a SessionVariables trait for sessions like Hive that support 
variables (including Hive vars)
* Adds SessionVariables support to variable substitution
* Adds SessionVariables support to the SET command

## How was this patch tested?

* Adds a test to all supported Hive versions for accessing Hive variables
* Adds HiveVariableSubstitutionSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark SPARK-18086-add-hivevar-support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15738.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15738






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15738: SPARK-18086: Add support for Hive session vars.

2016-11-02 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15738
  
The purpose of this is to ensure the same behavior as Hive so users can 
move to SparkSQL without rewriting queries. That's why I'm trying to keep 
behavior as close as possible to the behavior of Hive (and Spark 1.6.1).

Session vars are kept in a different namespace, so collisions should not 
overwrite configs and session vars take precedence in Hive. Because we still 
use a Hive SessionState and pass some commands to Hive, I thought it was also 
important to store the hive variables in the session state like Hive would. 
That way any command that are passed to Hive that have variable references work 
as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15738: SPARK-18086: Add support for Hive session vars.

2016-11-08 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/15738
  
Thanks for reviewing @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-19 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16533
  
@zero323, I think that the decorator and existing UDF factory method should 
be the same, but that we can't break existing code. Can you explain why this 
necessarily breaks code that relies on returnType as a positional arg? What 
about something like this:

```python
@functools.wraps(_udf)
def udf(f=None, returnType=StringType()):
"""A decorator version of pyspark.sql.functions.udf
"""
if f is None:
return functools.partial(_udf, returnType=returnType)
else:
return _udf(f=f, returnType=returnType)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator

2017-01-19 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16533
  
Yeah, I added the `isinstance(f, DataType)` trick to my local tests. I'd 
add it if it were up to me, but I'm fine requiring `returnType` to be a keyword 
arg if you feel strongly about it.

Looks like tests are broken, maybe rebase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL] Add udf decorator

2017-01-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16533
  
+1

Looks good to me. Thanks, @zero323!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

2017-01-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16536
  
+1

Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98043331
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

I think @holdenk's concern is that this would allow concurrent calls to 
`_create_udf`. That would create two `UserDefinedPythonFunction` objects, but I 
don't see anything on the Scala side that is concerning about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98049505
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

Yeah, I don't think it should require a lock. I think concurrent calls are 
very unlikely and safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL] Add udf decorator

2017-01-30 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16533
  
I agree with @zero323. I don't think this method should change how users 
interact with DataTypes. There is a precedent for using type strings instead, 
as @rxin suggested, and that's a good idea that fits the need better than 
allowing users to pass the type class instead of an instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-02-01 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16537#discussion_r98955007
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_should_validate_input_args(self):
+from pyspark.sql.functions import udf
+
+self.assertRaises(TypeError, udf(lambda x: x), None)
--- End diff --

If this is covered by existing tests, then that's fine. Good point.

To validate number of args, I think it is a good idea, as long as we know 
that it won't fail C extensions (but may be inconclusive).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

2017-02-01 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16537#discussion_r98970775
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
 row = 
self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
 self.assertTrue(row[0].find("people1.json") != -1)
 
+def test_udf_should_validate_input_args(self):
+from pyspark.sql.functions import udf
+
+self.assertRaises(TypeError, udf(lambda x: x), None)
--- End diff --

Sounds good. It's probably worth exploring eventually, but there's no need 
to hold up this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

2017-02-01 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16537
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16107: SPARK-18677: Fix parsing ['key'] in JSON path exp...

2016-12-01 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/16107

SPARK-18677: Fix parsing ['key'] in JSON path expressions.

## What changes were proposed in this pull request?

This fixes the parser rule to match named expressions, which doesn't work 
for two reasons:
1. The name match is not coerced to a regular expression (missing .r)
2. The surrounding literals are incorrect and attempt to escape a single 
quote, which is unnecessary

## How was this patch tested?

This adds test cases for named expressions using the bracket syntax, 
including one with quoted spaces.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark SPARK-18677-fix-json-path

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16107.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16107


commit 1c0cc097c5ee6f35da6fab732d7e7408d6306d1e
Author: Ryan Blue 
Date:   2016-12-01T20:24:58Z

SPARK-18677: Fix parsing ['key'] in JSON path expressions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16107: SPARK-18677: Fix parsing ['key'] in JSON path expression...

2016-12-02 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16107
  
Thanks for the quick review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16281
  
I don't think a fork is a good idea, nor do I think there is a reasonable 
need for one.

@gatorsmile brought up that the Parquet community refused to build a patch 
release: "The problem is the Parquet community will not create a branch 1.8.2+ 
for us." I don't remember this happening, and as far as I can tell from both 
google and my inbox, the Parquet community never rejected the idea of patch 
releases. If there was a conversation that I don't know about, then I apologize 
that you were given the impression that patch releases aren't possible. That 
isn't the case.

I'm happy to work with the community to put out patch releases, especially 
if that's needed for Spark. To demonstrate, look at PARQUET-389 and 
PARQUET-654. @rxin asked the Parquet dev list about predicate push-down 
features and within a week and a half, both of those issues were resolved. 
(PARQUET-389 is the fix for SPARK-18539, cited as motivation to fork.)

As for the other motivating issue, PARQUET-686, a fork can't help solve 
this problem. This is an issue that requires updating the Parquet format spec 
so you couldn't simply fix your own fork without abandoning compatibility. The 
Parquet community put out a release that gives the user a choice between 
correctness and performance, which is a good compromise until this can be fixed.

It is fair to point out that Parquet has not had a regular release cadence 
for minor releases (1.8.1 to 1.9.0), which is something that the Parquet 
community knows about and has discussed. We have recently committed to 
quarterly releases to fix this, with patch releases whenever they are needed. 
I'd encourage anyone interested to get involved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16281
  
Great! I'm glad it was just confusion. I completely agree with @srowen that 
forking should be a last resort.

In the future, please reach out to the community, whether its Parquet or 
another, to address concerns before it gets this far. It's better for everyone 
if we can use the feedback to improve and continue to work together.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16281
  
I think we should move to a 1.8.2 patch release. The reason is that 1.9.0 
moved to ByteBuffer based reads and we've found at least one problem with it. 
ByteBuffer based reads also changes an internal API that Spark uses for its 
vectorized reader. The changes here (now that I've looked at the actual PR) 
wouldn't work because Parquet is going to call the ByteBuffer method rather 
than the byte array method. I'm cleaning up the ByteBuffer code quite a bit 
right now for better performance with G1GC (see 
https://github.com/apache/parquet-mr/pull/390), so I think Spark should move to 
ByteBuffer reads after that makes it in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16281
  
Thanks @dongjoon-hyun! Lets get a Parquet 1.8.2 out in January.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/16281
  
The improvement is in how row groups are garbage collected. G1GC puts 
humongous allocations directly into the old generation, so you end up needing a 
full GC to reclaim the space. That just increases memory pressure, so you run 
out of memory and run full GCs and/or spill to disk. We don't have data yet 
because I haven't pushed the feature or metrics collection for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r182563063
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
 ---
@@ -63,115 +59,159 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
 }
   }
 
+  private ByteBuffer getBuffer(int length) {
+try {
+  return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read " + length + " 
bytes", e);
+}
+  }
+
   @Override
   public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
-c.putIntsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 4 * total;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putIntsLittleEndian(rowId, total, buffer.array(), offset - 
Platform.BYTE_ARRAY_OFFSET);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putInt(rowId + i, buffer.getInt());
+  }
+}
   }
 
   @Override
   public final void readLongs(int total, WritableColumnVector c, int 
rowId) {
-c.putLongsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 8 * total;
+int requiredBytes = total * 8;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putLongsLittleEndian(rowId, total, buffer.array(), offset - 
Platform.BYTE_ARRAY_OFFSET);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putLong(rowId + i, buffer.getLong());
+  }
+}
   }
 
   @Override
   public final void readFloats(int total, WritableColumnVector c, int 
rowId) {
-c.putFloats(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET);
-offset += 4 * total;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putFloats(rowId, total, buffer.array(), offset - 
Platform.BYTE_ARRAY_OFFSET);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putFloat(rowId + i, buffer.getFloat());
+  }
+}
   }
 
   @Override
   public final void readDoubles(int total, WritableColumnVector c, int 
rowId) {
-c.putDoubles(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
-offset += 8 * total;
+int requiredBytes = total * 8;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+if (buffer.hasArray()) {
+  int offset = buffer.arrayOffset() + buffer.position();
+  c.putDoubles(rowId, total, buffer.array(), offset - 
Platform.BYTE_ARRAY_OFFSET);
+} else {
+  for (int i = 0; i < total; i += 1) {
+c.putDouble(rowId + i, buffer.getDouble());
+  }
+}
+  }
+
+  private byte getByte() {
+try {
+  return (byte) in.read();
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read a byte", e);
+}
   }
 
   @Override
   public final void readBytes(int total, WritableColumnVector c, int 
rowId) {
-for (int i = 0; i < total; i++) {
-  // Bytes are stored as a 4-byte little endian int. Just read the 
first byte.
-  // TODO: consider pushing this in ColumnVector by adding a readBytes 
with a stride.
-  c.putByte(rowId + i, Platform.getByte(buffer, offset));
-  offset += 4;
+int requiredBytes = total * 4;
+ByteBuffer buffer = getBuffer(requiredBytes);
+
+for (int i = 0; i < total; i += 1) {
+  c.putByte(rowId + i, buffer.get());
+  // skip the next 3 bytes
+  buffer.position(buffer.position() + 3);
 }
   }
 
   @Override
   public final boolean readBoolean() {
-byte b = Platform.getByte(buffer, offset);
-boolean v = (b & (1 << bitOffset)) != 0;
+// TODO: vectorize decoding and keep boolean[] instead of currentByte
+if (bitOffset == 0) {
+  currentByte = getByte();
+}
+
+boolean v = (currentByte & (1 << bitOffset)) != 0;
 bitOffset += 1;
 if (bitOffset == 8) {
   bitOffset = 0;
-  offset++;
 }
 return v;
   }
 
   @Override
   public final int readInteg

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@scottcarey, Parquet will use the compressors if they are available. You 
can add them from an external Jar and it will work. LZ4 should also work out of 
the box because it is included in Hadoop 2.7.

I agree that it would be nice if Parquet didn't rely on Hadoop for 
compression libraries, but that's how it is at the moment. Feel free to fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20988#discussion_r182579387
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -129,35 +151,41 @@ case class OptimizeMetadataOnlyQuery(catalog: 
SessionCatalog) extends Rule[Logic
 
   /**
* A pattern that finds the partitioned table relation node inside the 
given plan, and returns a
-   * pair of the partition attributes and the table relation node.
+   * pair of the partition attributes, partition filters, and the table 
relation node.
*
* It keeps traversing down the given plan tree if there is a 
[[Project]] or [[Filter]] with
* deterministic expressions, and returns result after reaching the 
partitioned table relation
* node.
*/
-  object PartitionedRelation {
-
-def unapply(plan: LogicalPlan): Option[(AttributeSet, LogicalPlan)] = 
plan match {
-  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _)
-if fsRelation.partitionSchema.nonEmpty =>
-val partAttrs = 
getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l)
-Some((AttributeSet(partAttrs), l))
-
-  case relation: HiveTableRelation if 
relation.tableMeta.partitionColumnNames.nonEmpty =>
-val partAttrs = 
getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation)
-Some((AttributeSet(partAttrs), relation))
-
-  case p @ Project(projectList, child) if 
projectList.forall(_.deterministic) =>
-unapply(child).flatMap { case (partAttrs, relation) =>
-  if (p.references.subsetOf(partAttrs)) Some((p.outputSet, 
relation)) else None
-}
+  object PartitionedRelation extends PredicateHelper {
+
+def unapply(plan: LogicalPlan): Option[(AttributeSet, Seq[Expression], 
LogicalPlan)] = {
+  plan match {
+case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _)
+  if fsRelation.partitionSchema.nonEmpty =>
+  val partAttrs = 
getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l)
+  Some((AttributeSet(partAttrs), Nil, l))
+
+case relation: HiveTableRelation if 
relation.tableMeta.partitionColumnNames.nonEmpty =>
+  val partAttrs = 
getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation)
+  Some((AttributeSet(partAttrs), Nil, relation))
+
+case p @ Project(projectList, child) if 
projectList.forall(_.deterministic) =>
+  unapply(child).flatMap { case (partAttrs, filters, relation) =>
+if (p.references.subsetOf(partAttrs)) Some((p.outputSet, 
filters, relation)) else None
--- End diff --

@cloud-fan, that is basically how this works already. Each matched node 
calls `unapply(child)` to get the result from the child node, then it adds the 
current node's conditions to that result. Using `unapply` instead of 
`getPartitionedRelation` makes this work in the matching rule:

```scala
  case a @ Aggregate(_, aggExprs, child @ PartitionedRelation(partAttrs, 
filters, relation)) =>
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
I backported the Hadoop zstd codec to 2.7.3 without much trouble. But 
either way, I think that's a separate concern. This is about getting Parquet 
updated, not about worrying whether users can easily add compression 
implementations to their classpath.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20988: [SPARK-23877][SQL]: Use filter predicates to prune parti...

2018-04-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20988
  
@cloud-fan, I've added the test. Thanks for letting me know about 
HiveCatalogMetrics, that's exactly what I needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21029#discussion_r182596153
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java
 ---
@@ -52,10 +56,46 @@
   }
 
   /**
-   * Returns a data reader to do the actual reading work.
+   * The output data format of this factory's data reader. Spark will 
invoke the corresponding
+   * create data reader method w.r.t. the return value of this method:
+   * 
+   *   {@link DataFormat#ROW}: {@link #createRowDataReader()}
+   *   {@link DataFormat#UNSAFE_ROW}: {@link 
#createUnsafeRowDataReader()}
+   *   @{@link DataFormat#COLUMNAR_BATCH}: {@link 
#createColumnarBatchDataReader()}
+   * 
+   */
+  DataFormat dataFormat();
--- End diff --

If the data format is determined when the factory is created, then I don't 
see why it is necessary to change the API. This just makes it more confusing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21029#discussion_r182596274
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReader.java
 ---
@@ -17,12 +17,12 @@
 
 package org.apache.spark.sql.sources.v2.reader.streaming;
 
+import java.util.Optional;
--- End diff --

Nit: this is a cosmetic change that should be reverted before committing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21029#discussion_r182596392
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
 ---
@@ -17,52 +17,85 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import scala.collection.JavaConverters._
-import scala.reflect.ClassTag
-
 import org.apache.spark.{InterruptibleIterator, Partition, SparkContext, 
TaskContext}
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.sources.v2.reader.DataReaderFactory
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.sql.sources.v2.DataFormat
+import org.apache.spark.sql.sources.v2.reader.{DataReader, 
DataReaderFactory}
+import org.apache.spark.sql.types.StructType
 
-class DataSourceRDDPartition[T : ClassTag](val index: Int, val 
readerFactory: DataReaderFactory[T])
+class DataSourceRDDPartition(val index: Int, val factory: 
DataReaderFactory)
   extends Partition with Serializable
 
-class DataSourceRDD[T: ClassTag](
+class DataSourceRDD(
 sc: SparkContext,
-@transient private val readerFactories: Seq[DataReaderFactory[T]])
-  extends RDD[T](sc, Nil) {
+@transient private val readerFactories: Seq[DataReaderFactory],
+schema: StructType)
+  extends RDD[InternalRow](sc, Nil) {
 
   override protected def getPartitions: Array[Partition] = {
 readerFactories.zipWithIndex.map {
   case (readerFactory, index) => new DataSourceRDDPartition(index, 
readerFactory)
 }.toArray
   }
 
-  override def compute(split: Partition, context: TaskContext): 
Iterator[T] = {
-val reader = 
split.asInstanceOf[DataSourceRDDPartition[T]].readerFactory.createDataReader()
-context.addTaskCompletionListener(_ => reader.close())
-val iter = new Iterator[T] {
-  private[this] var valuePrepared = false
-
-  override def hasNext: Boolean = {
-if (!valuePrepared) {
-  valuePrepared = reader.next()
-}
-valuePrepared
-  }
-
-  override def next(): T = {
-if (!hasNext) {
-  throw new java.util.NoSuchElementException("End of stream")
-}
-valuePrepared = false
-reader.get()
-  }
+  override def compute(split: Partition, context: TaskContext): 
Iterator[InternalRow] = {
+val factory = split.asInstanceOf[DataSourceRDDPartition].factory
+val iter: DataReaderIterator[UnsafeRow] = factory.dataFormat() match {
+  case DataFormat.ROW =>
+val reader = new RowToUnsafeDataReader(
+  factory.createRowDataReader(), 
RowEncoder.apply(schema).resolveAndBind())
+new DataReaderIterator(reader)
+
+  case DataFormat.UNSAFE_ROW =>
+new DataReaderIterator(factory.createUnsafeRowDataReader())
+
+  case DataFormat.COLUMNAR_BATCH =>
+new DataReaderIterator(factory.createColumnarBatchDataReader())
+  // TODO: remove this type erase hack.
+  .asInstanceOf[DataReaderIterator[UnsafeRow]]
--- End diff --

Isn't this change intended to avoid these casts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21029#discussion_r182596471
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
 ---
@@ -146,7 +146,7 @@ private[kafka010] class KafkaMicroBatchReader(
   new KafkaMicroBatchDataReaderFactory(
 range, executorKafkaParams, pollTimeoutMs, failOnDataLoss, 
reuseKafkaConsumer)
 }
-factories.map(_.asInstanceOf[DataReaderFactory[UnsafeRow]]).asJava
+factories.map(_.asInstanceOf[DataReaderFactory]).asJava
--- End diff --

Why is this cast necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...

2018-04-19 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20988#discussion_r182858087
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -129,35 +151,41 @@ case class OptimizeMetadataOnlyQuery(catalog: 
SessionCatalog) extends Rule[Logic
 
   /**
* A pattern that finds the partitioned table relation node inside the 
given plan, and returns a
-   * pair of the partition attributes and the table relation node.
+   * pair of the partition attributes, partition filters, and the table 
relation node.
*
* It keeps traversing down the given plan tree if there is a 
[[Project]] or [[Filter]] with
* deterministic expressions, and returns result after reaching the 
partitioned table relation
* node.
*/
-  object PartitionedRelation {
-
-def unapply(plan: LogicalPlan): Option[(AttributeSet, LogicalPlan)] = 
plan match {
-  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _)
-if fsRelation.partitionSchema.nonEmpty =>
-val partAttrs = 
getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l)
-Some((AttributeSet(partAttrs), l))
-
-  case relation: HiveTableRelation if 
relation.tableMeta.partitionColumnNames.nonEmpty =>
-val partAttrs = 
getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation)
-Some((AttributeSet(partAttrs), relation))
-
-  case p @ Project(projectList, child) if 
projectList.forall(_.deterministic) =>
-unapply(child).flatMap { case (partAttrs, relation) =>
-  if (p.references.subsetOf(partAttrs)) Some((p.outputSet, 
relation)) else None
-}
+  object PartitionedRelation extends PredicateHelper {
+
+def unapply(plan: LogicalPlan): Option[(AttributeSet, Seq[Expression], 
LogicalPlan)] = {
+  plan match {
+case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _)
+  if fsRelation.partitionSchema.nonEmpty =>
+  val partAttrs = 
getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l)
+  Some((AttributeSet(partAttrs), Nil, l))
+
+case relation: HiveTableRelation if 
relation.tableMeta.partitionColumnNames.nonEmpty =>
+  val partAttrs = 
getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation)
+  Some((AttributeSet(partAttrs), Nil, relation))
+
+case p @ Project(projectList, child) if 
projectList.forall(_.deterministic) =>
+  unapply(child).flatMap { case (partAttrs, filters, relation) =>
+if (p.references.subsetOf(partAttrs)) Some((p.outputSet, 
filters, relation)) else None
+  }
 
-  case f @ Filter(condition, child) if condition.deterministic =>
-unapply(child).flatMap { case (partAttrs, relation) =>
-  if (f.references.subsetOf(partAttrs)) Some((partAttrs, 
relation)) else None
-}
+case f @ Filter(condition, child) if condition.deterministic =>
+  unapply(child).flatMap { case (partAttrs, filters, relation) =>
+if (f.references.subsetOf(partAttrs)) {
+  Some((partAttrs, splitConjunctivePredicates(condition) ++ 
filters, relation))
--- End diff --

Good catch. I've added a test case and updated the `PartitionedRelation` 
code to keep track of both original partition attributes -- that the filter 
needs -- and the top-most node's output that is used by the rule.

For using `PhysicalOperation` instead of `PartitionedRelation`, I don't see 
a compelling reason for such an invasive change. This currently adds a couple 
of results to unapply and keeps mostly the same logic. `PhysicalOperation` 
would lose the check that the references are a subset of the partition 
attributes and be a lot larger change. If you think this should be refactored, 
lets talk about that separately to understand the motivation for the change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21111: [SPARK-23877][SQL][followup] use PhysicalOperatio...

2018-04-20 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/2#discussion_r183101013
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -114,11 +119,8 @@ case class OptimizeMetadataOnlyQuery(catalog: 
SessionCatalog) extends Rule[Logic
 relation match {
   case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, 
isStreaming) =>
 val partAttrs = 
getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l)
-val partitionData = fsRelation.location.listFiles(relFilters, 
Nil)
-// partition data may be a stream, which can cause 
serialization to hit stack level too
-// deep exceptions because it is a recursive structure in 
memory. converting to array
-// avoids the problem.
--- End diff --

Yes, that does fix it but that's in a non-obvious way. What isn't clear is 
what guarantees that the rows used to construct the LocalRelation will never 
need to be serialized. Would it be reasonable for a future commit to remove the 
`@transient` modifier and re-introduce the problem?

I would rather this return the data in a non-recursive structure, but it's 
a minor point.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21111: [SPARK-23877][SQL][followup] use PhysicalOperation to si...

2018-04-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/2
  
Thanks for doing this as a follow-up. I had one minor comment, but 
otherwise I'm +1. I see what you mean about using `PhysicalOperation` now. It 
is slightly cleaner and I guess `PhysicalOperation` is the right way to 
accumulate the filters and projection from a sub-tree.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

2018-04-20 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/21118

SPARK-23325: Use InternalRow when reading with DataSourceV2.

## What changes were proposed in this pull request?

This updates the DataSourceV2 API to use InternalRow instead of Row for the 
default case with no scan mix-ins. Because the API is changing significantly in 
the same places, this also renames ReaderFactory back to ReadTask.

Support for readers that produce Row is added through 
SupportsDeprecatedScanRow, which matches the previous API. Readers that used 
Row now implement this class and should be migrated to InternalRow.

Readers that previously implemented SupportsScanUnsafeRow have been 
migrated to use no SupportsScan mix-ins and produce InternalRow.

## How was this patch tested?

This uses existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-23325-datasource-v2-internal-row

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21118.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21118


commit eddd049ed78970dda0796a37462e52f53bfeacc4
Author: Ryan Blue 
Date:   2018-04-20T20:15:58Z

SPARK-23325: Use InternalRow when reading with DataSourceV2.

This updates the DataSourceV2 API to use InternalRow instead of Row for
the default case with no scan mix-ins. Because the API is changing
significantly in the same places, this also renames ReaderFactory back
to ReadTask.

Support for readers that produce Row is added through
SupportsDeprecatedScanRow, which matches the previous API. Readers that
used Row now implement this class and should be migrated to InternalRow.

Readers that previously implemented SupportsScanUnsafeRow have been
migrated to use no SupportsScan mix-ins and produce InternalRow.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-04-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21118
  
@jose-torres, @cloud-fan, can you take a look at this? It updates the v2 
API to use InternalRow by default.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-04-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21118
  
Yeah, we should probably add a projection. It's probably only working 
because the InternalRows that are produced are all UnsafeRow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@cloud-fan, is there a Jenkins job to run it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Okay, I don't have the time to set up and run benchmarks without a stronger 
case for this causing a regression (given the Parquet testing), but other 
people should feel free to pick this up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
> Based on the previous upgrade (e.g., #13280 (comment)), we hit the 
performance regressions when we upgrade Parquet and we did the revert at the 
end.

I should point out that the regression wasn't reproducible, so we aren't 
sure what the cause was. We also didn't have performance numbers on the Parquet 
side or a case of anyone running it in production (we have been for a couple 
months). But, I can understand wanting to be thorough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21145: SPARK-24073: Rename DataReaderFactory to ReadTask...

2018-04-24 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/21145

SPARK-24073: Rename DataReaderFactory to ReadTask.


## What changes were proposed in this pull request?

This reverses the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.

ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
specific read task for a partition of the data to be read, in contrast
to DataWriterFactory where the same factory instance is used in all
write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
with an explicit create operation to mirror the close operation. This is
no longer clear from the API, where DataReaderFactory appears to be more
generic than it is and it isn't clear why a set of them is produced for
a read.

## How was this patch tested?

Existing tests, which have been updated to use the new name.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-24073-revert-data-reader-factory-rename

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21145.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21145


commit c364c05d3141bbe0ed29a2b02cecfa541d9c8212
Author: Ryan Blue 
Date:   2018-04-24T19:55:25Z

SPARK-24073: Rename DataReaderFactory to ReadTask.

This reverses the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.

ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
specific read task for a partition of the data to be read, in contrast
to DataWriterFactory where the same factory instance is used in all
write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
with an explicit create operation to mirror the close operation. This is
no longer clear from the API, where DataReaderFactory appears to be more
generic than it is and it isn't clear why a set of them is produced for
a read.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r184139736
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
 ---
@@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with 
SharedSQLContext {
 case plan: InMemoryRelation => plan
   }.head
   // InMemoryRelation's stats is file size before the underlying 
RDD is materialized
-  assert(inMemoryRelation.computeStats().sizeInBytes === 740)
+  assert(inMemoryRelation.computeStats().sizeInBytes === 800)
--- End diff --

Parquet fixed a problem with value ordering in statistics, which required 
adding new metadata min and max fields. For older readers, Parquet also writes 
the old values when it makes sense to. This is a slight increase in overhead, 
which is more noticeable when files contain just a few records.

Don't be alarmed at the percentage difference here, it is just a small 
file. Parquet isn't increasing file sizes by 8%, that would be silly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
There are two main reasons to update. First, the problem behind SPARK-17213 
is fixed, hence the new min and max fields. Second, this updates the internal 
byte array management, which is needed for page skipping in the next few 
versions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21070#discussion_r184156731
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
 ---
@@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with 
SharedSQLContext {
 case plan: InMemoryRelation => plan
   }.head
   // InMemoryRelation's stats is file size before the underlying 
RDD is materialized
-  assert(inMemoryRelation.computeStats().sizeInBytes === 740)
+  assert(inMemoryRelation.computeStats().sizeInBytes === 800)
--- End diff --

This is data dependent so it is hard to estimate. We write the stats for 
older readers when the type uses a signed sort order, so it is limited to 
mostly primitive types and won't be written for byte arrays or utf8 data. That 
limits the size to 16 bytes + thrift overhead per page and you might have about 
100 pages per row group. So 1.5k per 128MB, which is about 0.001%.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
The fix for PARQUET-686 was to suppress min/max stats. It is safe to push 
filters, but those filters can't be used without the stats. 1.10.0 has the 
correct stats and can use those filters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21118#discussion_r184216025
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
 ---
@@ -86,7 +87,7 @@ class KafkaContinuousReader(
 KafkaSourceOffset(JsonUtils.partitionOffsets(json))
   }
 
-  override def createUnsafeRowReaderFactories(): 
ju.List[DataReaderFactory[UnsafeRow]] = {
+  override def createReadTasks(): ju.List[ReadTask[InternalRow]] = {
--- End diff --

I've moved this to #21145. I'll rebase this PR on that one, so lets try to 
get that in first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21145#discussion_r184235329
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
 ---
@@ -299,13 +299,13 @@ private[kafka010] class KafkaMicroBatchReader(
   }
 }
 
-/** A [[DataReaderFactory]] for reading Kafka data in a micro-batch 
streaming query. */
+/** A [[ReadTask]] for reading Kafka data in a micro-batch streaming 
query. */
 private[kafka010] case class KafkaMicroBatchDataReaderFactory(
--- End diff --

This fixes the API, not implementations, and it already touches 30+ files.

I'd rather not fix the downstream classes for two reasons. First, to avoid 
this becoming really large. Second, we need to be able to evolve these APIs 
without requiring changes to all implementations. This is still evolving and if 
we need to update 20+ implementations to make simple changes, then I think 
that's a problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Thank you @maropu!

What resources does the run require? Is it something we could create a 
Jenkins job to run?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21143
  
Thanks for working on this, @cloud-fan! I was thinking about needing it 
just recently so that data sources can delegate to Spark when needed.

I'll have a thorough look at it tomorrow, but one quick high-level 
question: should we make these residuals based on the input split instead?

Input splits might have different residual filters that need to be applied. 
For example, if you have a time range query, `ts > X`, and are storing data by 
day, then you know that when `day(ts) > day(X)` that `ts > X` *must* be true, 
but when `day(ts) = day(X)`, `ts > X` *might* be true. So for only some splits, 
when scanning the boundary day, you need to run the original filter, but not 
for any other splits.

Another use case for a per-split residual is when splits might be different 
file formats. Parquet allows pushing down filters, but Avro doesn't. In a mixed 
table format it would be great for Avro splits to return the entire expression 
as a residual, while Parquet splits do the filtering.

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >