[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16537 I don't think it is a good idea to think that this has little use because it is a dumb mistake to pass something that isn't callable. In this case, it's easy to accidentally reuse a name for a function and a variable (e.g., `format`), especially as scripts change over time and pass from one maintainer to another. Spark should have reasonable behavior for any error, as opposed to being harder to work with because we thought the user wasn't likely to hit a particular problem. This is very few lines of code that will make a user's experience much better because it can catch exactly what the problem is, without running a job. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16537 Sorry, my example was for validating the object passed to `udf` was callable, not for the use of the UDF. I still think it's a good idea not to make assumptions about how a user makes a mistake. Error checking should be proportional to how difficult it is to find what happened, not how likely it is to happen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16537 Yeah, I thought this was the other PR that validates the function is callable. Still, I don't agree that it's okay for python to be less friendly as long as we don't think people will hit the problem too much or because they solve the problem before asking a list. There are reasonable ways to hit this and Spark should give a good explanation about what went wrong. I'm not saying we have to go fix all of the cases like this, but where there's a PR ready to go I think it's better to include it than not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16537 Maybe we're at an agree to disagree situation, but I think we may be talking about different things. If you're saying that we should try to keep these together to make reviews easier, I'd agree. I was under the impression that this change may be rejected because it isn't important enough of a problem, which I think isn't a good way of looking at it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16519: SPARK-19138: Don't return SparkSession for stoppe...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/16519 SPARK-19138: Don't return SparkSession for stopped SparkContext. ## What changes were proposed in this pull request? * Update SparkSession to always return a session using the current SparkContext * Add SparkContext#isStopped ## How was this patch tested? Tested that sqlContext.sql works when created after closing a Spark context: ```python sc.stop() sc = SparkContext(conf=SparkConf()) sqlContext = HiveContext(sc) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-19138-pyspark-stale-spark-context Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16519.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16519 commit 6d56df1e0d8c95e04eea41cd9776c72b34f9998b Author: Ryan Blue Date: 2017-01-09T21:28:32Z SPARK-19138: Don't return SparkSession for stopped SparkContext. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16519: SPARK-19138: Don't return SparkSession for stopped Spark...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16519 Yeah, it looks like this is basically the same problem. I'll add some review comments to the other issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession initializat...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16454 I just posted a fix to this also. I'll close that one in favor of this and add comments here for what it did differently that we should consider. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16454#discussion_r95258453 --- Diff: python/pyspark/sql/session.py --- @@ -161,8 +161,8 @@ def getOrCreate(self): with self._lock: from pyspark.context import SparkContext from pyspark.conf import SparkConf -session = SparkSession._instantiatedContext -if session is None: +session = SparkSession._instantiatedSession +if session is None or session._sc._jsc is None: --- End diff -- #16519 adds `isStopped`, which I think is a better way of checking whether the Spark context is valid than just checking whether `_jsc` is defined because `_jsc` could be stopped. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16454#discussion_r95259073 --- Diff: python/pyspark/sql/session.py --- @@ -214,8 +214,12 @@ def __init__(self, sparkContext, jsparkSession=None): self._wrapped = SQLContext(self._sc, self, self._jwrapped) _monkey_patch_RDD(self) install_exception_handler() -if SparkSession._instantiatedContext is None: -SparkSession._instantiatedContext = self +# If we had an instantiated SparkSession attached with a SparkContext +# which is stopped now, we need to renew the instantiated SparkSession. +# Otherwise, we will use invalid SparkSession when we call Builder.getOrCreate. +if SparkSession._instantiatedSession is None \ +or SparkSession._instantiatedSession._sc._jsc is None: --- End diff -- What about setting `SparkSession._instantiatedSession` to `None` in `getOrCreate` instead of duplicating the logic? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16519: SPARK-19138: Don't return SparkSession for stoppe...
Github user rdblue closed the pull request at: https://github.com/apache/spark/pull/16519 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16519: SPARK-19138: Don't return SparkSession for stopped Spark...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16519 Closing in favor of #16454. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16537#discussion_r95484684 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_should_validate_input_args(self): +from pyspark.sql.functions import udf + +self.assertRaises(TypeError, udf(lambda x: x), None) --- End diff -- I think this should have positive tests for a column and a string as well as a negative test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16537#discussion_r95484969 --- Diff: python/pyspark/sql/functions.py --- @@ -1848,6 +1848,12 @@ def __del__(self): self._broadcast = None def __call__(self, *cols): +for c in cols: +if not isinstance(c, (Column, str)): +raise TypeError( +"All arguments should be Columns or strings representing column names. " --- End diff -- "All arguments" is a little vague, since this is going to be called later in code than the UDF definition. What about an error message like "Invalid UDF argument, not a String or Column: {0} of type {1}" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16537#discussion_r95486774 --- Diff: python/pyspark/sql/functions.py --- @@ -1848,6 +1848,12 @@ def __del__(self): self._broadcast = None def __call__(self, *cols): +for c in cols: +if not isinstance(c, (Column, str)): +raise TypeError( +"All arguments should be Columns or strings representing column names. " --- End diff -- A literal ends up being a `Column`, aren't the others as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16535#discussion_r95486972 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_shouldnt_accept_noncallable_object(self): +from pyspark.sql.functions import udf + +self.assertRaises(TypeError, udf) --- End diff -- I think this should have positive tests for a function and a class that defines `__call__`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16535#discussion_r95487983 --- Diff: python/pyspark/sql/functions.py --- @@ -1824,6 +1824,12 @@ class UserDefinedFunction(object): .. versionadded:: 1.3 """ def __init__(self, func, returnType, name=None): +if not callable(func): +raise TypeError( +"func should be a callable object " --- End diff -- I think the error message could be more clear: "Not a function or callable (__call__ is not defined): {}" I don't think it's necessary to use the name of the argument, but it isn't a problem if you think it is better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16535#discussion_r95488127 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_shouldnt_accept_noncallable_object(self): +from pyspark.sql.functions import udf + +self.assertRaises(TypeError, udf) --- End diff -- Why doesn't this test call udf? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16533#discussion_r95490867 --- Diff: python/pyspark/sql/decorators.py --- @@ -0,0 +1,25 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.sql.types import StringType +from pyspark.sql.functions import udf as _udf + + +def udf(returnType=StringType()): --- End diff -- I don't think this should conflict with pyspark.sql.functions.udf. If this were defined, then it would break existing code that uses functions.udf. It may be possible to have a decorator definition that maintains that API and acts as a decorator, but otherwise this should be renamed to something like `register_udf`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16533#discussion_r95491058 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,17 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_with_decorator(self): +from pyspark.sql.decorators import udf +from pyspark.sql.types import IntegerType + +@udf(IntegerType()) --- End diff -- Could you add a test case for the default return type? I think it currently requires empty-parens: ``` @udf() def truncate(s): return s[:10] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16533#discussion_r95491193 --- Diff: python/pyspark/sql/decorators.py --- @@ -0,0 +1,25 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.sql.types import StringType +from pyspark.sql.functions import udf as _udf + + +def udf(returnType=StringType()): --- End diff -- Can you also add docstrings for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16533 I think if the `udf` method takes as its first arg `func=None`, then you can match the old API and take care of the empty-paren problem. If a callable isn't passed as the first arg, then return a decorator function. If a callable is passed in, then register it. You can do something similar with `*args` if you don't want to require passing `returnType` as a named arg. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16535#discussion_r95627424 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_shouldnt_accept_noncallable_object(self): +from pyspark.sql.functions import udf + +self.assertRaises(TypeError, udf) --- End diff -- Sorry if my wording was confusing. My point was that it called `udf` as a test method without arguments, which isn't the call you wanted to test. The update that passes `non_callable` is what I was looking for. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16535: [SPARK-19162][PYTHON][SQL][WIP] UserDefinedFunction shou...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16535 +1. Thanks for working on this @zero323! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16533#discussion_r95654895 --- Diff: python/pyspark/sql/decorators.py --- @@ -0,0 +1,25 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.sql.types import StringType +from pyspark.sql.functions import udf as _udf + + +def udf(returnType=StringType()): --- End diff -- I think that the two should be merged. If they have the same name, then what works in one place breaks in others and the two are mutually exclusive. Our users share code quite a bit and it is common to copy & paste UDFs. That should be fine, or should result in "NameError: name 'udf' is not defined", rather than having different behavior. With the changes to the decorator to support using it without parens, it's already close to being compatible with the current `functions.udf`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16454#discussion_r95846835 --- Diff: python/pyspark/sql/session.py --- @@ -214,8 +214,12 @@ def __init__(self, sparkContext, jsparkSession=None): self._wrapped = SQLContext(self._sc, self, self._jwrapped) _monkey_patch_RDD(self) install_exception_handler() -if SparkSession._instantiatedContext is None: -SparkSession._instantiatedContext = self +# If we had an instantiated SparkSession attached with a SparkContext +# which is stopped now, we need to renew the instantiated SparkSession. +# Otherwise, we will use invalid SparkSession when we call Builder.getOrCreate. +if SparkSession._instantiatedSession is None \ +or SparkSession._instantiatedSession._sc._jsc is None: --- End diff -- If this can't tell whether the java context is stopped, then it should be fine to use this logic. +1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16454: [SPARK-19055][SQL][PySpark] Fix SparkSession init...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16454#discussion_r95846995 --- Diff: python/pyspark/sql/session.py --- @@ -161,8 +161,8 @@ def getOrCreate(self): with self._lock: from pyspark.context import SparkContext from pyspark.conf import SparkConf -session = SparkSession._instantiatedContext -if session is None: +session = SparkSession._instantiatedSession +if session is None or session._sc._jsc is None: --- End diff -- Yeah, I had to change to just checking whether _jsc was defined in our branch for the same reason. I'd prefer to add isStopped to Java and Python eventually, but that doesn't need to block this commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15088 @srowen, could you take another look at this? It has been useful for us to catch locking issues and I'd like to get it in for 2.1.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17412: [SPARK-20084][Core] Remove internal.metrics.updat...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/17412 [SPARK-20084][Core] Remove internal.metrics.updatedBlockStatuses from history files. ## What changes were proposed in this pull request? Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks. ## How was this patch tested? Current History UI tests cover use of the history file. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-20084-remove-block-accumulator-info Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17412.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17412 commit 386846c224842ba299a198ca296cfc2b72b188cc Author: Ryan Blue Date: 2017-03-24T18:21:19Z SPARK-20084: Remove internal.metrics.updatedBlockStatuses from history files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17412: [SPARK-20084][Core] Remove internal.metrics.updatedBlock...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/17412 @vanzin, I looked through the listeners in the Spark UI for uses of it and didn't find any, plus I haven't seen anything in the UI that is obviously based on it. As I said on the issue, the updates duplicate data in metrics so it would be odd for a component to go to accumulators to get this info. I'm deploying this internally today, maybe it makes sense to give it a couple days to see if we get complaints about something not working. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17412: [SPARK-20084][Core] Remove internal.metrics.updatedBlock...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/17412 @vanzin, we've been running this in production for a few days now and haven't had any problems. I think it should be safe to merge. Could you take another look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operat...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/17540 [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in SQL UI tab. ## What changes were proposed in this pull request? Wraps `DataFrameWriter` operations in `SQLExecution.withNewExecutionId` so that `SparkListenerSQLExecutionStart` and `SparkListenerSQLExecutionEnd` are sent and the query shows up in the SQL tab of the UI. ## How was this patch tested? Tested by hand that `insertInto` results in queries in the SQL tab. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-20213-fix-sql-tab Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17540.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17540 commit f9342b5fea4fd3dbe32c12930f96aa9af38c386b Author: Ryan Blue Date: 2017-04-05T17:28:19Z SPARK-20213: Fix DataFrameWriter operations in SQL UI tab. Not wrapping the execution started by DataFrameWriter in SQLExecution.withNewExecutionId causes those executions to not show up in the SQL UI tab because the SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd messages are never sent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/17540 @srowen, agreed. Closely related but not the same code paths. The question is: when should `withNewExecutionId` get called? I'm running the test suite now and this patch causes test failures when `withNewExecutionId` is called twice; once in `DataFrameWriter` and once in `InsertIntoHadoopFsRelationCommand`. It looks like the call has now been littered about the codebase (e.g. in `InsertIntoHadoopFsRelationCommand` and other execution nodes) to fix this problem on certain operations, so we should decide where it should be used and fix tests around that. The reason why I added it to `DataFrameWriter` is that it is called in `Dataset` actions, and it makes sense to call it once from where an action is started. I think it makes the most sense for action methods, like `Dataset#collect` or `DataFrameWriter#insertInto` to minimize the number of places we need to add it. I don't think this is a concern that should be addressed by the execution plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/17540 @cloud-fan, can you look at this? What do you think about the question above: when should `withNewExecutionId` get called? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17540: [SPARK-20213][SQL][UI] Fix DataFrameWriter operations in...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/17540 The new test failures are caused by a check I inserted. Moving where `withNewExecutionId` gets called could result in missing SQL queries in the UI, so anywhere I'm replacing `withNewExecutionId` with `checkSQLExecutionId` that will cause tests (and only tests) to fail. That way, we can catch all of the call stacks that should have it. This caught problems in SQL command execution and I've added a patch to fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17106: [SPARK-19775][SQL] Remove an obsolete `partitionBy().ins...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/17106 +1 Looks fine to me. `partitionBy` support with `insertInto` was removed. This is probably still passing because it expects an analysis exception that is thrown for a different reason. We should update assertions like this to verify the error message to avoid that in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15816: SPARK-18368: Fix regexp_replace with task seriali...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/15816 SPARK-18368: Fix regexp_replace with task serialization. ## What changes were proposed in this pull request? This makes the result value both transient and lazy, so that if the RegExpReplace object is initialized then serialized, `result: StringBuffer` will be correctly initialized. ## How was this patch tested? Verified that this patch fixed the query that found the bug. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-18368-fix-regexp-replace Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15816.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15816 commit 8740a2c3b918796e1ac87c3e178f9f9f4651cb75 Author: Ryan Blue Date: 2016-10-26T16:42:43Z SPARK-18368: Fix regexp_replace with task serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15816: SPARK-18368: Fix regexp_replace with task serialization.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15816 @hvanhovell, I updated the test so that all expressions passed through `checkEvaluation` are serialized and deserialized before the checks are run. I'll open an issue if it catches anything else. I'd rather not hold up this PR if it finds lots of other bugs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15816: SPARK-18368: Fix regexp_replace with task serialization.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15816 Thanks @rxin and @hvanhovell! I appreciate the quick reviews. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15834: [SPARK-18368] [SQL] Fix regexp replace when seria...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/15834 [SPARK-18368] [SQL] Fix regexp replace when serialized ## What changes were proposed in this pull request? This makes the result value both transient and lazy, so that if the RegExpReplace object is initialized then serialized, `result: StringBuffer` will be correctly initialized. ## How was this patch tested? * Verified that this patch fixed the query that found the bug. * Added a test case that fails without the fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-18368-fix-regexp-replace Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15834.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15834 commit 8740a2c3b918796e1ac87c3e178f9f9f4651cb75 Author: Ryan Blue Date: 2016-10-26T16:42:43Z SPARK-18368: Fix regexp_replace with task serialization. commit 3536f6a44f15e2997a1061a6e7ec061f98c3ef9f Author: Ryan Blue Date: 2016-11-08T23:40:25Z SPARK-18368: Add test for regexp_replace serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15834: [SPARK-18368] [SQL] Fix regexp replace when serialized
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15834 @yhuai, this replaces #15816. The ref, 3536f6a, has already passed tests on that PR so it should be safe to merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15834: [SPARK-18368] [SQL] Fix regexp replace when serialized
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15834 [SPARK-18387](https://issues.apache.org/jira/browse/SPARK-18387) tracks the other bugs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15847: SPARK-18368: Add serialization to checkEvaluation...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/15847 SPARK-18368: Add serialization to checkEvaluation. ## What changes were proposed in this pull request? This removes the serialization test from RegexpExpressionsSuite and replaces it by serializing all expressions in checkEvaluation. This also fixes math constant expressions by making LeafMathExpression Serializable and fixes NumberFormat values that are null or invalid after serialization. ## How was this patch tested? This patch is to tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-18387-fix-serializable-expressions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15847.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15847 commit 8e829ae87b197de2ff4b8777202a47d5f1204c56 Author: Ryan Blue Date: 2016-11-09T00:16:11Z SPARK-18368: Add serialization to checkEvaluation. This removes the serialization test from RegexpExpressionsSuite and replaces it by serializing all expressions in checkEvaluation. This also fixes math constant expressions by making LeafMathExpression Serializable and fixes NumberFormat values that are null or invalid after serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15847: [SPARK-18387] [SQL] Add serialization to checkEva...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/15847#discussion_r87619721 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala --- @@ -36,7 +36,7 @@ import org.apache.spark.unsafe.types.UTF8String * @param name The short name of the function */ abstract class LeafMathExpression(c: Double, name: String) - extends LeafExpression with CodegenFallback { + extends LeafExpression with CodegenFallback with Serializable { --- End diff -- The case classes are serializable, but the superclass must be also. If this isn't present, then no default constructor is created for this abstract class and its children fail when they are deserialized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15847: [SPARK-18387] [SQL] Add serialization to checkEva...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/15847#discussion_r87620051 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -1431,43 +1431,49 @@ case class FormatNumber(x: Expression, d: Expression) // Associated with the pattern, for the last d value, and we will update the // pattern (DecimalFormat) once the new coming d value differ with the last one. + // This is an Option to distinguish between 0 (numberFormat is valid) and uninitialized after + // serialization (numberFormat has not been updated for dValue = 0). @transient - private var lastDValue: Int = -100 + private var lastDValue: Option[Int] = None --- End diff -- This is a variable so it can't be lazily initialized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15847: [SPARK-18387] [SQL] Add serialization to checkEvaluation...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15847 Thanks! I just went to cherry-pick the commit, but it was already there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14720: SPARK-12868: Allow Add jar to add jars from hdfs/...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/14720#discussion_r87908473 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala --- @@ -856,6 +856,17 @@ class HiveQuerySuite extends HiveComparisonTest with BeforeAndAfter { sql("DROP TABLE alter1") } + test("SPARK-12868 ADD JAR FROM HDFS") { +val testJar = "hdfs://nn:8020/foo.jar" +// This should fail with unknown host, as its just testing the URL parsing +// before SPARK-12868 it was failing with Malformed URI +val e = intercept[RuntimeException] { --- End diff -- I think this test should be improved before merging this. Looking for a RuntimeException to validate that the Jar was registered is brittle and can easily pass when the registration doesn't actually work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15062: SPARK-17424: Fix unsound substitution bug in ScalaReflec...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15062 I tried to reproduce with #10970 reverted, but I didn't hit the issue in testing. I still think it's fine to move forward on this, even if it is hard to reproduce because we know the code is wrong and this fixes it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15062: SPARK-17424: Fix unsound substitution bug in ScalaReflec...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15062 We were seeing the problem when using the datasets API in our 1.6.1 build, which is based on Scala 2.10. I recently tried to reproduce this on master with 2.11 and #10970 reverted, but I didn't get a case that failed. Either way, I think the fix here makes sense: if there are no types to substitute, don't do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15062: SPARK-17424: Fix unsound substitution bug in ScalaReflec...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15062 This does look the same as SPARK-17109. Does this fix that issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15088 @srowen, if you have a chance, could you look at this again? I think it will be helpful for tracking down live-lock issues. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13880: SPARK-16178: Remove unnecessary Hive partition ch...
Github user rdblue closed the pull request at: https://github.com/apache/spark/pull/13880 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15088: SPARK-17532: Add lock debugging info to thread dumps.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15088 Thanks @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15738: SPARK-18086: Add support for Hive session vars.
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/15738 SPARK-18086: Add support for Hive session vars. ## What changes were proposed in this pull request? This adds support for Hive variables: * Makes values set via `spark-sql --hivevar name=value` accessible * Adds `getHiveVar` and `setHiveVar` to the `HiveClient` interface * Adds a SessionVariables trait for sessions like Hive that support variables (including Hive vars) * Adds SessionVariables support to variable substitution * Adds SessionVariables support to the SET command ## How was this patch tested? * Adds a test to all supported Hive versions for accessing Hive variables * Adds HiveVariableSubstitutionSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-18086-add-hivevar-support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15738.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15738 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15738: SPARK-18086: Add support for Hive session vars.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15738 The purpose of this is to ensure the same behavior as Hive so users can move to SparkSQL without rewriting queries. That's why I'm trying to keep behavior as close as possible to the behavior of Hive (and Spark 1.6.1). Session vars are kept in a different namespace, so collisions should not overwrite configs and session vars take precedence in Hive. Because we still use a Hive SessionState and pass some commands to Hive, I thought it was also important to store the hive variables in the session state like Hive would. That way any command that are passed to Hive that have variable references work as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15738: SPARK-18086: Add support for Hive session vars.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/15738 Thanks for reviewing @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16533 @zero323, I think that the decorator and existing UDF factory method should be the same, but that we can't break existing code. Can you explain why this necessarily breaks code that relies on returnType as a positional arg? What about something like this: ```python @functools.wraps(_udf) def udf(f=None, returnType=StringType()): """A decorator version of pyspark.sql.functions.udf """ if f is None: return functools.partial(_udf, returnType=returnType) else: return _udf(f=f, returnType=returnType) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL][WIP] Add udf decorator
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16533 Yeah, I added the `isinstance(f, DataType)` trick to my local tests. I'd add it if it were up to me, but I'm fine requiring `returnType` to be a keyword arg if you feel strongly about it. Looks like tests are broken, maybe rebase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL] Add udf decorator
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16533 +1 Looks good to me. Thanks, @zero323! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16536 +1 Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16536#discussion_r98043331 --- Diff: python/pyspark/sql/functions.py --- @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object): def __init__(self, func, returnType, name=None): self.func = func self.returnType = returnType -self._judf = self._create_judf(name) - -def _create_judf(self, name): +self._judf_placeholder = None +self._name = name or ( +func.__name__ if hasattr(func, '__name__') +else func.__class__.__name__) + +@property +def _judf(self): +if self._judf_placeholder is None: --- End diff -- I think @holdenk's concern is that this would allow concurrent calls to `_create_udf`. That would create two `UserDefinedPythonFunction` objects, but I don't see anything on the Scala side that is concerning about that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16536#discussion_r98049505 --- Diff: python/pyspark/sql/functions.py --- @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object): def __init__(self, func, returnType, name=None): self.func = func self.returnType = returnType -self._judf = self._create_judf(name) - -def _create_judf(self, name): +self._judf_placeholder = None +self._name = name or ( +func.__name__ if hasattr(func, '__name__') +else func.__class__.__name__) + +@property +def _judf(self): +if self._judf_placeholder is None: --- End diff -- Yeah, I don't think it should require a lock. I think concurrent calls are very unlikely and safe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL] Add udf decorator
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16533 I agree with @zero323. I don't think this method should change how users interact with DataTypes. There is a precedent for using type strings instead, as @rxin suggested, and that's a good idea that fits the need better than allowing users to pass the type class instead of an instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16537#discussion_r98955007 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_should_validate_input_args(self): +from pyspark.sql.functions import udf + +self.assertRaises(TypeError, udf(lambda x: x), None) --- End diff -- If this is covered by existing tests, then that's fine. Good point. To validate number of args, I think it is a good idea, as long as we know that it won't fail C extensions (but may be inconclusive). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/16537#discussion_r98970775 --- Diff: python/pyspark/sql/tests.py --- @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self): row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first() self.assertTrue(row[0].find("people1.json") != -1) +def test_udf_should_validate_input_args(self): +from pyspark.sql.functions import udf + +self.assertRaises(TypeError, udf(lambda x: x), None) --- End diff -- Sounds good. It's probably worth exploring eventually, but there's no need to hold up this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16537 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16107: SPARK-18677: Fix parsing ['key'] in JSON path exp...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/16107 SPARK-18677: Fix parsing ['key'] in JSON path expressions. ## What changes were proposed in this pull request? This fixes the parser rule to match named expressions, which doesn't work for two reasons: 1. The name match is not coerced to a regular expression (missing .r) 2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary ## How was this patch tested? This adds test cases for named expressions using the bracket syntax, including one with quoted spaces. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-18677-fix-json-path Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16107.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16107 commit 1c0cc097c5ee6f35da6fab732d7e7408d6306d1e Author: Ryan Blue Date: 2016-12-01T20:24:58Z SPARK-18677: Fix parsing ['key'] in JSON path expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16107: SPARK-18677: Fix parsing ['key'] in JSON path expression...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16107 Thanks for the quick review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 I don't think a fork is a good idea, nor do I think there is a reasonable need for one. @gatorsmile brought up that the Parquet community refused to build a patch release: "The problem is the Parquet community will not create a branch 1.8.2+ for us." I don't remember this happening, and as far as I can tell from both google and my inbox, the Parquet community never rejected the idea of patch releases. If there was a conversation that I don't know about, then I apologize that you were given the impression that patch releases aren't possible. That isn't the case. I'm happy to work with the community to put out patch releases, especially if that's needed for Spark. To demonstrate, look at PARQUET-389 and PARQUET-654. @rxin asked the Parquet dev list about predicate push-down features and within a week and a half, both of those issues were resolved. (PARQUET-389 is the fix for SPARK-18539, cited as motivation to fork.) As for the other motivating issue, PARQUET-686, a fork can't help solve this problem. This is an issue that requires updating the Parquet format spec so you couldn't simply fix your own fork without abandoning compatibility. The Parquet community put out a release that gives the user a choice between correctness and performance, which is a good compromise until this can be fixed. It is fair to point out that Parquet has not had a regular release cadence for minor releases (1.8.1 to 1.9.0), which is something that the Parquet community knows about and has discussed. We have recently committed to quarterly releases to fix this, with patch releases whenever they are needed. I'd encourage anyone interested to get involved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 Great! I'm glad it was just confusion. I completely agree with @srowen that forking should be a last resort. In the future, please reach out to the community, whether its Parquet or another, to address concerns before it gets this far. It's better for everyone if we can use the feedback to improve and continue to work together. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 I think we should move to a 1.8.2 patch release. The reason is that 1.9.0 moved to ByteBuffer based reads and we've found at least one problem with it. ByteBuffer based reads also changes an internal API that Spark uses for its vectorized reader. The changes here (now that I've looked at the actual PR) wouldn't work because Parquet is going to call the ByteBuffer method rather than the byte array method. I'm cleaning up the ByteBuffer code quite a bit right now for better performance with G1GC (see https://github.com/apache/parquet-mr/pull/390), so I think Spark should move to ByteBuffer reads after that makes it in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 Thanks @dongjoon-hyun! Lets get a Parquet 1.8.2 out in January. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/16281 The improvement is in how row groups are garbage collected. G1GC puts humongous allocations directly into the old generation, so you end up needing a full GC to reclaim the space. That just increases memory pressure, so you run out of memory and run full GCs and/or spill to disk. We don't have data yet because I haven't pushed the feature or metrics collection for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r182563063 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,159 @@ public final void readBooleans(int total, WritableColumnVector c, int rowId) { } } + private ByteBuffer getBuffer(int length) { +try { + return in.slice(length).order(ByteOrder.LITTLE_ENDIAN); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read " + length + " bytes", e); +} + } + @Override public final void readIntegers(int total, WritableColumnVector c, int rowId) { -c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 4 * total; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putIntsLittleEndian(rowId, total, buffer.array(), offset - Platform.BYTE_ARRAY_OFFSET); +} else { + for (int i = 0; i < total; i += 1) { +c.putInt(rowId + i, buffer.getInt()); + } +} } @Override public final void readLongs(int total, WritableColumnVector c, int rowId) { -c.putLongsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 8 * total; +int requiredBytes = total * 8; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putLongsLittleEndian(rowId, total, buffer.array(), offset - Platform.BYTE_ARRAY_OFFSET); +} else { + for (int i = 0; i < total; i += 1) { +c.putLong(rowId + i, buffer.getLong()); + } +} } @Override public final void readFloats(int total, WritableColumnVector c, int rowId) { -c.putFloats(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 4 * total; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putFloats(rowId, total, buffer.array(), offset - Platform.BYTE_ARRAY_OFFSET); +} else { + for (int i = 0; i < total; i += 1) { +c.putFloat(rowId + i, buffer.getFloat()); + } +} } @Override public final void readDoubles(int total, WritableColumnVector c, int rowId) { -c.putDoubles(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); -offset += 8 * total; +int requiredBytes = total * 8; +ByteBuffer buffer = getBuffer(requiredBytes); + +if (buffer.hasArray()) { + int offset = buffer.arrayOffset() + buffer.position(); + c.putDoubles(rowId, total, buffer.array(), offset - Platform.BYTE_ARRAY_OFFSET); +} else { + for (int i = 0; i < total; i += 1) { +c.putDouble(rowId + i, buffer.getDouble()); + } +} + } + + private byte getByte() { +try { + return (byte) in.read(); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read a byte", e); +} } @Override public final void readBytes(int total, WritableColumnVector c, int rowId) { -for (int i = 0; i < total; i++) { - // Bytes are stored as a 4-byte little endian int. Just read the first byte. - // TODO: consider pushing this in ColumnVector by adding a readBytes with a stride. - c.putByte(rowId + i, Platform.getByte(buffer, offset)); - offset += 4; +int requiredBytes = total * 4; +ByteBuffer buffer = getBuffer(requiredBytes); + +for (int i = 0; i < total; i += 1) { + c.putByte(rowId + i, buffer.get()); + // skip the next 3 bytes + buffer.position(buffer.position() + 3); } } @Override public final boolean readBoolean() { -byte b = Platform.getByte(buffer, offset); -boolean v = (b & (1 << bitOffset)) != 0; +// TODO: vectorize decoding and keep boolean[] instead of currentByte +if (bitOffset == 0) { + currentByte = getByte(); +} + +boolean v = (currentByte & (1 << bitOffset)) != 0; bitOffset += 1; if (bitOffset == 8) { bitOffset = 0; - offset++; } return v; } @Override public final int readInteg
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @scottcarey, Parquet will use the compressors if they are available. You can add them from an external Jar and it will work. LZ4 should also work out of the box because it is included in Hadoop 2.7. I agree that it would be nice if Parquet didn't rely on Hadoop for compression libraries, but that's how it is at the moment. Feel free to fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20988#discussion_r182579387 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -129,35 +151,41 @@ case class OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[Logic /** * A pattern that finds the partitioned table relation node inside the given plan, and returns a - * pair of the partition attributes and the table relation node. + * pair of the partition attributes, partition filters, and the table relation node. * * It keeps traversing down the given plan tree if there is a [[Project]] or [[Filter]] with * deterministic expressions, and returns result after reaching the partitioned table relation * node. */ - object PartitionedRelation { - -def unapply(plan: LogicalPlan): Option[(AttributeSet, LogicalPlan)] = plan match { - case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _) -if fsRelation.partitionSchema.nonEmpty => -val partAttrs = getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l) -Some((AttributeSet(partAttrs), l)) - - case relation: HiveTableRelation if relation.tableMeta.partitionColumnNames.nonEmpty => -val partAttrs = getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation) -Some((AttributeSet(partAttrs), relation)) - - case p @ Project(projectList, child) if projectList.forall(_.deterministic) => -unapply(child).flatMap { case (partAttrs, relation) => - if (p.references.subsetOf(partAttrs)) Some((p.outputSet, relation)) else None -} + object PartitionedRelation extends PredicateHelper { + +def unapply(plan: LogicalPlan): Option[(AttributeSet, Seq[Expression], LogicalPlan)] = { + plan match { +case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _) + if fsRelation.partitionSchema.nonEmpty => + val partAttrs = getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l) + Some((AttributeSet(partAttrs), Nil, l)) + +case relation: HiveTableRelation if relation.tableMeta.partitionColumnNames.nonEmpty => + val partAttrs = getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation) + Some((AttributeSet(partAttrs), Nil, relation)) + +case p @ Project(projectList, child) if projectList.forall(_.deterministic) => + unapply(child).flatMap { case (partAttrs, filters, relation) => +if (p.references.subsetOf(partAttrs)) Some((p.outputSet, filters, relation)) else None --- End diff -- @cloud-fan, that is basically how this works already. Each matched node calls `unapply(child)` to get the result from the child node, then it adds the current node's conditions to that result. Using `unapply` instead of `getPartitionedRelation` makes this work in the matching rule: ```scala case a @ Aggregate(_, aggExprs, child @ PartitionedRelation(partAttrs, filters, relation)) => ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 I backported the Hadoop zstd codec to 2.7.3 without much trouble. But either way, I think that's a separate concern. This is about getting Parquet updated, not about worrying whether users can easily add compression implementations to their classpath. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20988: [SPARK-23877][SQL]: Use filter predicates to prune parti...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20988 @cloud-fan, I've added the test. Thanks for letting me know about HiveCatalogMetrics, that's exactly what I needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596153 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java --- @@ -52,10 +56,46 @@ } /** - * Returns a data reader to do the actual reading work. + * The output data format of this factory's data reader. Spark will invoke the corresponding + * create data reader method w.r.t. the return value of this method: + * + * {@link DataFormat#ROW}: {@link #createRowDataReader()} + * {@link DataFormat#UNSAFE_ROW}: {@link #createUnsafeRowDataReader()} + * @{@link DataFormat#COLUMNAR_BATCH}: {@link #createColumnarBatchDataReader()} + * + */ + DataFormat dataFormat(); --- End diff -- If the data format is determined when the factory is created, then I don't see why it is necessary to change the API. This just makes it more confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596274 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReader.java --- @@ -17,12 +17,12 @@ package org.apache.spark.sql.sources.v2.reader.streaming; +import java.util.Optional; --- End diff -- Nit: this is a cosmetic change that should be reverted before committing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596392 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala --- @@ -17,52 +17,85 @@ package org.apache.spark.sql.execution.datasources.v2 -import scala.collection.JavaConverters._ -import scala.reflect.ClassTag - import org.apache.spark.{InterruptibleIterator, Partition, SparkContext, TaskContext} import org.apache.spark.rdd.RDD -import org.apache.spark.sql.sources.v2.reader.DataReaderFactory +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.expressions.UnsafeRow +import org.apache.spark.sql.sources.v2.DataFormat +import org.apache.spark.sql.sources.v2.reader.{DataReader, DataReaderFactory} +import org.apache.spark.sql.types.StructType -class DataSourceRDDPartition[T : ClassTag](val index: Int, val readerFactory: DataReaderFactory[T]) +class DataSourceRDDPartition(val index: Int, val factory: DataReaderFactory) extends Partition with Serializable -class DataSourceRDD[T: ClassTag]( +class DataSourceRDD( sc: SparkContext, -@transient private val readerFactories: Seq[DataReaderFactory[T]]) - extends RDD[T](sc, Nil) { +@transient private val readerFactories: Seq[DataReaderFactory], +schema: StructType) + extends RDD[InternalRow](sc, Nil) { override protected def getPartitions: Array[Partition] = { readerFactories.zipWithIndex.map { case (readerFactory, index) => new DataSourceRDDPartition(index, readerFactory) }.toArray } - override def compute(split: Partition, context: TaskContext): Iterator[T] = { -val reader = split.asInstanceOf[DataSourceRDDPartition[T]].readerFactory.createDataReader() -context.addTaskCompletionListener(_ => reader.close()) -val iter = new Iterator[T] { - private[this] var valuePrepared = false - - override def hasNext: Boolean = { -if (!valuePrepared) { - valuePrepared = reader.next() -} -valuePrepared - } - - override def next(): T = { -if (!hasNext) { - throw new java.util.NoSuchElementException("End of stream") -} -valuePrepared = false -reader.get() - } + override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = { +val factory = split.asInstanceOf[DataSourceRDDPartition].factory +val iter: DataReaderIterator[UnsafeRow] = factory.dataFormat() match { + case DataFormat.ROW => +val reader = new RowToUnsafeDataReader( + factory.createRowDataReader(), RowEncoder.apply(schema).resolveAndBind()) +new DataReaderIterator(reader) + + case DataFormat.UNSAFE_ROW => +new DataReaderIterator(factory.createUnsafeRowDataReader()) + + case DataFormat.COLUMNAR_BATCH => +new DataReaderIterator(factory.createColumnarBatchDataReader()) + // TODO: remove this type erase hack. + .asInstanceOf[DataReaderIterator[UnsafeRow]] --- End diff -- Isn't this change intended to avoid these casts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596471 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala --- @@ -146,7 +146,7 @@ private[kafka010] class KafkaMicroBatchReader( new KafkaMicroBatchDataReaderFactory( range, executorKafkaParams, pollTimeoutMs, failOnDataLoss, reuseKafkaConsumer) } -factories.map(_.asInstanceOf[DataReaderFactory[UnsafeRow]]).asJava +factories.map(_.asInstanceOf[DataReaderFactory]).asJava --- End diff -- Why is this cast necessary? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20988#discussion_r182858087 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -129,35 +151,41 @@ case class OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[Logic /** * A pattern that finds the partitioned table relation node inside the given plan, and returns a - * pair of the partition attributes and the table relation node. + * pair of the partition attributes, partition filters, and the table relation node. * * It keeps traversing down the given plan tree if there is a [[Project]] or [[Filter]] with * deterministic expressions, and returns result after reaching the partitioned table relation * node. */ - object PartitionedRelation { - -def unapply(plan: LogicalPlan): Option[(AttributeSet, LogicalPlan)] = plan match { - case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _) -if fsRelation.partitionSchema.nonEmpty => -val partAttrs = getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l) -Some((AttributeSet(partAttrs), l)) - - case relation: HiveTableRelation if relation.tableMeta.partitionColumnNames.nonEmpty => -val partAttrs = getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation) -Some((AttributeSet(partAttrs), relation)) - - case p @ Project(projectList, child) if projectList.forall(_.deterministic) => -unapply(child).flatMap { case (partAttrs, relation) => - if (p.references.subsetOf(partAttrs)) Some((p.outputSet, relation)) else None -} + object PartitionedRelation extends PredicateHelper { + +def unapply(plan: LogicalPlan): Option[(AttributeSet, Seq[Expression], LogicalPlan)] = { + plan match { +case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, _) + if fsRelation.partitionSchema.nonEmpty => + val partAttrs = getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l) + Some((AttributeSet(partAttrs), Nil, l)) + +case relation: HiveTableRelation if relation.tableMeta.partitionColumnNames.nonEmpty => + val partAttrs = getPartitionAttrs(relation.tableMeta.partitionColumnNames, relation) + Some((AttributeSet(partAttrs), Nil, relation)) + +case p @ Project(projectList, child) if projectList.forall(_.deterministic) => + unapply(child).flatMap { case (partAttrs, filters, relation) => +if (p.references.subsetOf(partAttrs)) Some((p.outputSet, filters, relation)) else None + } - case f @ Filter(condition, child) if condition.deterministic => -unapply(child).flatMap { case (partAttrs, relation) => - if (f.references.subsetOf(partAttrs)) Some((partAttrs, relation)) else None -} +case f @ Filter(condition, child) if condition.deterministic => + unapply(child).flatMap { case (partAttrs, filters, relation) => +if (f.references.subsetOf(partAttrs)) { + Some((partAttrs, splitConjunctivePredicates(condition) ++ filters, relation)) --- End diff -- Good catch. I've added a test case and updated the `PartitionedRelation` code to keep track of both original partition attributes -- that the filter needs -- and the top-most node's output that is used by the rule. For using `PhysicalOperation` instead of `PartitionedRelation`, I don't see a compelling reason for such an invasive change. This currently adds a couple of results to unapply and keeps mostly the same logic. `PhysicalOperation` would lose the check that the references are a subset of the partition attributes and be a lot larger change. If you think this should be refactored, lets talk about that separately to understand the motivation for the change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21111: [SPARK-23877][SQL][followup] use PhysicalOperatio...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/2#discussion_r183101013 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -114,11 +119,8 @@ case class OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[Logic relation match { case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _, isStreaming) => val partAttrs = getPartitionAttrs(fsRelation.partitionSchema.map(_.name), l) -val partitionData = fsRelation.location.listFiles(relFilters, Nil) -// partition data may be a stream, which can cause serialization to hit stack level too -// deep exceptions because it is a recursive structure in memory. converting to array -// avoids the problem. --- End diff -- Yes, that does fix it but that's in a non-obvious way. What isn't clear is what guarantees that the rows used to construct the LocalRelation will never need to be serialized. Would it be reasonable for a future commit to remove the `@transient` modifier and re-introduce the problem? I would rather this return the data in a non-recursive structure, but it's a minor point. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21111: [SPARK-23877][SQL][followup] use PhysicalOperation to si...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/2 Thanks for doing this as a follow-up. I had one minor comment, but otherwise I'm +1. I see what you mean about using `PhysicalOperation` now. It is slightly cleaner and I guess `PhysicalOperation` is the right way to accumulate the filters and projection from a sub-tree. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21118 SPARK-23325: Use InternalRow when reading with DataSourceV2. ## What changes were proposed in this pull request? This updates the DataSourceV2 API to use InternalRow instead of Row for the default case with no scan mix-ins. Because the API is changing significantly in the same places, this also renames ReaderFactory back to ReadTask. Support for readers that produce Row is added through SupportsDeprecatedScanRow, which matches the previous API. Readers that used Row now implement this class and should be migrated to InternalRow. Readers that previously implemented SupportsScanUnsafeRow have been migrated to use no SupportsScan mix-ins and produce InternalRow. ## How was this patch tested? This uses existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-23325-datasource-v2-internal-row Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21118.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21118 commit eddd049ed78970dda0796a37462e52f53bfeacc4 Author: Ryan Blue Date: 2018-04-20T20:15:58Z SPARK-23325: Use InternalRow when reading with DataSourceV2. This updates the DataSourceV2 API to use InternalRow instead of Row for the default case with no scan mix-ins. Because the API is changing significantly in the same places, this also renames ReaderFactory back to ReadTask. Support for readers that produce Row is added through SupportsDeprecatedScanRow, which matches the previous API. Readers that used Row now implement this class and should be migrated to InternalRow. Readers that previously implemented SupportsScanUnsafeRow have been migrated to use no SupportsScan mix-ins and produce InternalRow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @jose-torres, @cloud-fan, can you take a look at this? It updates the v2 API to use InternalRow by default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 Yeah, we should probably add a projection. It's probably only working because the InternalRows that are produced are all UnsafeRow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @cloud-fan, is there a Jenkins job to run it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Okay, I don't have the time to set up and run benchmarks without a stronger case for this causing a regression (given the Parquet testing), but other people should feel free to pick this up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 > Based on the previous upgrade (e.g., #13280 (comment)), we hit the performance regressions when we upgrade Parquet and we did the revert at the end. I should point out that the regression wasn't reproducible, so we aren't sure what the cause was. We also didn't have performance numbers on the Parquet side or a case of anyone running it in production (we have been for a couple months). But, I can understand wanting to be thorough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21145: SPARK-24073: Rename DataReaderFactory to ReadTask...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21145 SPARK-24073: Rename DataReaderFactory to ReadTask. ## What changes were proposed in this pull request? This reverses the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a specific read task for a partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. ## How was this patch tested? Existing tests, which have been updated to use the new name. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-24073-revert-data-reader-factory-rename Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21145.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21145 commit c364c05d3141bbe0ed29a2b02cecfa541d9c8212 Author: Ryan Blue Date: 2018-04-24T19:55:25Z SPARK-24073: Rename DataReaderFactory to ReadTask. This reverses the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a specific read task for a partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r184139736 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { case plan: InMemoryRelation => plan }.head // InMemoryRelation's stats is file size before the underlying RDD is materialized - assert(inMemoryRelation.computeStats().sizeInBytes === 740) + assert(inMemoryRelation.computeStats().sizeInBytes === 800) --- End diff -- Parquet fixed a problem with value ordering in statistics, which required adding new metadata min and max fields. For older readers, Parquet also writes the old values when it makes sense to. This is a slight increase in overhead, which is more noticeable when files contain just a few records. Don't be alarmed at the percentage difference here, it is just a small file. Parquet isn't increasing file sizes by 8%, that would be silly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 There are two main reasons to update. First, the problem behind SPARK-17213 is fixed, hence the new min and max fields. Second, this updates the internal byte array management, which is needed for page skipping in the next few versions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r184156731 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -503,7 +503,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { case plan: InMemoryRelation => plan }.head // InMemoryRelation's stats is file size before the underlying RDD is materialized - assert(inMemoryRelation.computeStats().sizeInBytes === 740) + assert(inMemoryRelation.computeStats().sizeInBytes === 800) --- End diff -- This is data dependent so it is hard to estimate. We write the stats for older readers when the type uses a signed sort order, so it is limited to mostly primitive types and won't be written for byte arrays or utf8 data. That limits the size to 16 bytes + thrift overhead per page and you might have about 100 pages per row group. So 1.5k per 128MB, which is about 0.001%. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 The fix for PARQUET-686 was to suppress min/max stats. It is safe to push filters, but those filters can't be used without the stats. 1.10.0 has the correct stats and can use those filters. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21118#discussion_r184216025 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala --- @@ -86,7 +87,7 @@ class KafkaContinuousReader( KafkaSourceOffset(JsonUtils.partitionOffsets(json)) } - override def createUnsafeRowReaderFactories(): ju.List[DataReaderFactory[UnsafeRow]] = { + override def createReadTasks(): ju.List[ReadTask[InternalRow]] = { --- End diff -- I've moved this to #21145. I'll rebase this PR on that one, so lets try to get that in first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r184235329 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala --- @@ -299,13 +299,13 @@ private[kafka010] class KafkaMicroBatchReader( } } -/** A [[DataReaderFactory]] for reading Kafka data in a micro-batch streaming query. */ +/** A [[ReadTask]] for reading Kafka data in a micro-batch streaming query. */ private[kafka010] case class KafkaMicroBatchDataReaderFactory( --- End diff -- This fixes the API, not implementations, and it already touches 30+ files. I'd rather not fix the downstream classes for two reasons. First, to avoid this becoming really large. Second, we need to be able to evolve these APIs without requiring changes to all implementations. This is still evolving and if we need to update 20+ implementations to make simple changes, then I think that's a problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Thank you @maropu! What resources does the run require? Is it something we could create a Jenkins job to run? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 Thanks for working on this, @cloud-fan! I was thinking about needing it just recently so that data sources can delegate to Spark when needed. I'll have a thorough look at it tomorrow, but one quick high-level question: should we make these residuals based on the input split instead? Input splits might have different residual filters that need to be applied. For example, if you have a time range query, `ts > X`, and are storing data by day, then you know that when `day(ts) > day(X)` that `ts > X` *must* be true, but when `day(ts) = day(X)`, `ts > X` *might* be true. So for only some splits, when scanning the boundary day, you need to run the original filter, but not for any other splits. Another use case for a per-split residual is when splits might be different file formats. Parquet allows pushing down filters, but Avro doesn't. In a mixed table format it would be great for Avro splits to return the entire expression as a residual, while Parquet splits do the filtering. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org