[spark] branch master updated (7cfd589 -> 1e2d76e)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7cfd589 [SPARK-28893][SQL] Support MERGE INTO in the parser and add the corresponding logical plan add 1e2d76e [HOT-FIX] Fix the SQLBase.g4 No new revisions were added by this update. Summary of changes: .../src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 | 4 1 file changed, 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8152a87 -> 7cfd589)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8152a87 [SPARK-28978][ ] Support > 256 args to python udf add 7cfd589 [SPARK-28893][SQL] Support MERGE INTO in the parser and add the corresponding logical plan No new revisions were added by this update. Summary of changes: docs/sql-keywords.md | 2 + .../apache/spark/sql/catalyst/parser/SqlBase.g4| 43 +++- .../spark/sql/catalyst/analysis/Analyzer.scala | 45 .../spark/sql/catalyst/parser/AstBuilder.scala | 129 --- .../sql/catalyst/plans/logical/v2Commands.scala| 45 +++- .../spark/sql/catalyst/parser/DDLParserSuite.scala | 179 +++- .../spark/sql/execution/SparkStrategies.scala | 2 + .../sql-tests/results/postgreSQL/comments.sql.out | 18 +- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 85 .../execution/command/PlanResolutionSuite.scala| 235 - 10 files changed, 742 insertions(+), 41 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28978][ ] Support > 256 args to python udf
This is an automated email from the ASF dual-hosted git repository. meng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8152a87 [SPARK-28978][ ] Support > 256 args to python udf 8152a87 is described below commit 8152a87235a63a13969f7c1ff5ed038956e8ed76 Author: Bago Amirbekian AuthorDate: Fri Nov 8 19:19:14 2019 -0800 [SPARK-28978][ ] Support > 256 args to python udf ### What changes were proposed in this pull request? On the worker we express lambda functions as strings and then eval them to create a "mapper" function. This make the code hard to read & limits the # of arguments a udf can support to 256 for python <= 3.6. This PR rewrites the mapper functions as nested functions instead of "lambda strings" and allows passing in more than 255 args. ### Why are the changes needed? The jira ticket associated with this issue describes how MLflow uses udfs to consume columns as features. This pattern isn't unique and a limit of 255 features is quite low. ### Does this PR introduce any user-facing change? Users can now pass more than 255 cols to a udf function. ### How was this patch tested? Added a unit test for passing in > 255 args to udf. Closes #26442 from MrBago/replace-lambdas-on-worker. Authored-by: Bago Amirbekian Signed-off-by: Xiangrui Meng --- python/pyspark/sql/tests/test_udf.py | 13 python/pyspark/worker.py | 62 +--- 2 files changed, 42 insertions(+), 33 deletions(-) diff --git a/python/pyspark/sql/tests/test_udf.py b/python/pyspark/sql/tests/test_udf.py index c274dc7..3b9f12f 100644 --- a/python/pyspark/sql/tests/test_udf.py +++ b/python/pyspark/sql/tests/test_udf.py @@ -629,6 +629,19 @@ class UDFTests(ReusedSQLTestCase): self.sc.parallelize(range(1), 1).mapPartitions(task).count() +def test_udf_with_256_args(self): +N = 256 +data = [["data-%d" % i for i in range(N)]] * 5 +df = self.spark.createDataFrame(data) + +def f(*a): +return "success" + +fUdf = udf(f, StringType()) + +r = df.select(fUdf(*df.columns)) +self.assertEqual(r.first()[0], "success") + class UDFInitializationTests(unittest.TestCase): def tearDown(self): diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py index 3a1200e..bfa8d97 100644 --- a/python/pyspark/worker.py +++ b/python/pyspark/worker.py @@ -403,54 +403,50 @@ def read_udfs(pickleSer, infile, eval_type): idx += offsets_len return parsed -udfs = {} -call_udf = [] -mapper_str = "" if eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF: -# Create function like this: -# lambda a: f([a[0]], [a[0], a[1]]) - # We assume there is only one UDF here because grouped map doesn't # support combining multiple UDFs. assert num_udfs == 1 # See FlatMapGroupsInPandasExec for how arg_offsets are used to # distinguish between grouping attributes and data attributes -arg_offsets, udf = read_single_udf( -pickleSer, infile, eval_type, runner_conf, udf_index=0) -udfs['f'] = udf +arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0) parsed_offsets = extract_key_value_indexes(arg_offsets) -keys = ["a[%d]" % (o,) for o in parsed_offsets[0][0]] -vals = ["a[%d]" % (o, ) for o in parsed_offsets[0][1]] -mapper_str = "lambda a: f([%s], [%s])" % (", ".join(keys), ", ".join(vals)) + +# Create function like this: +# mapper a: f([a[0]], [a[0], a[1]]) +def mapper(a): +keys = [a[o] for o in parsed_offsets[0][0]] +vals = [a[o] for o in parsed_offsets[0][1]] +return f(keys, vals) elif eval_type == PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF: # We assume there is only one UDF here because cogrouped map doesn't # support combining multiple UDFs. assert num_udfs == 1 -arg_offsets, udf = read_single_udf( -pickleSer, infile, eval_type, runner_conf, udf_index=0) -udfs['f'] = udf +arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0) + parsed_offsets = extract_key_value_indexes(arg_offsets) -df1_keys = ["a[0][%d]" % (o, ) for o in parsed_offsets[0][0]] -df1_vals = ["a[0][%d]" % (o, ) for o in parsed_offsets[0][1]] -df2_keys = ["a[1][%d]" % (o, ) for o in parsed_offsets[1][0]] -df2_vals = ["a[1][%d]" % (o, ) for o in parsed_offsets[1][1]] -mapper_str = "lambda a: f([%s], [%s], [%s], [%s])" % ( -", ".join(df1_keys), ", ".join(df1_vals), ", ".join(df2_keys), ", ".join(df2_vals)) + +def
[spark] branch master updated (7fc9db0 -> 70987d8)
This is an automated email from the ASF dual-hosted git repository. viirya pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7fc9db0 [SPARK-29798][PYTHON][SQL] Infers bytes as binary type in createDataFrame in Python 3 at PySpark add 70987d8 [SPARK-29680][SQL][FOLLOWUP] Replace qualifiedName with multipartIdentifier No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/parser/SqlBase.g4| 30 ++--- .../spark/sql/catalyst/parser/AstBuilder.scala | 38 +++--- .../sql/catalyst/parser/ErrorParserSuite.scala | 26 +++ .../spark/sql/execution/SparkSqlParser.scala | 14 4 files changed, 77 insertions(+), 31 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0bdadba -> 7fc9db0)
This is an automated email from the ASF dual-hosted git repository. cutlerb pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0bdadba [SPARK-29790][DOC] Note required port for Kube API add 7fc9db0 [SPARK-29798][PYTHON][SQL] Infers bytes as binary type in createDataFrame in Python 3 at PySpark No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_types.py | 16 python/pyspark/sql/types.py| 5 + 2 files changed, 21 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-29790][DOC] Note required port for Kube API
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new bef7c0f [SPARK-29790][DOC] Note required port for Kube API bef7c0f is described below commit bef7c0fddbddf1cc9d22b3ea60153b7e1bf8809d Author: Emil Sandstø AuthorDate: Fri Nov 8 09:32:29 2019 -0800 [SPARK-29790][DOC] Note required port for Kube API It adds a note about the required port of a master url in Kubernetes. Currently a port needs to be specified for the Kubernetes API. Also in case the API is hosted on the HTTPS port. Else the driver might fail with https://medium.com/kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d Yes, a change to the "Running on Kubernetes" guide. None - Documentation change Closes #26426 from Tapped/patch-1. Authored-by: Emil Sandstø Signed-off-by: Marcelo Vanzin (cherry picked from commit 0bdadba5e3810f8e3f5da13e2a598071cbadab94) Signed-off-by: Marcelo Vanzin --- docs/running-on-kubernetes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 0277043..037e1d5 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -103,7 +103,7 @@ $ bin/spark-submit \ ``` The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting -`spark.master` in the application's configuration, must be a URL with the format `k8s://`. Prefixing the +`spark.master` in the application's configuration, must be a URL with the format `k8s://:`. The port must always be specified, even if it's the HTTPS port 443. Prefixing the master string with `k8s://` will cause the Spark application to launch on the Kubernetes cluster, with the API server being contacted at `api_server_url`. If no HTTP protocol is specified in the URL, it defaults to `https`. For example, setting the master to `k8s://example.com:443` is equivalent to setting it to `k8s://https://example.com:443`, but to - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e026412 -> 0bdadba)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e026412 [SPARK-29679][SQL] Make interval type comparable and orderable add 0bdadba [SPARK-29790][DOC] Note required port for Kube API No new revisions were added by this update. Summary of changes: docs/running-on-kubernetes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e026412 -> 0bdadba)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e026412 [SPARK-29679][SQL] Make interval type comparable and orderable add 0bdadba [SPARK-29790][DOC] Note required port for Kube API No new revisions were added by this update. Summary of changes: docs/running-on-kubernetes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e7f7990 -> e026412)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e7f7990 [SPARK-29688][SQL] Support average for interval type values add e026412 [SPARK-29679][SQL] Make interval type comparable and orderable No new revisions were added by this update. Summary of changes: .../spark/unsafe/types/CalendarInterval.java | 25 ++- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 5 + .../expressions/codegen/CodeGenerator.scala| 1 + .../spark/sql/catalyst/expressions/ordering.scala | 1 + .../apache/spark/sql/catalyst/util/TypeUtils.scala | 1 + .../spark/sql/types/CalendarIntervalType.scala | 3 + .../test/resources/sql-tests/inputs/interval.sql | 43 + .../resources/sql-tests/results/interval.sql.out | 180 + 8 files changed, 258 insertions(+), 1 deletion(-) create mode 100644 sql/core/src/test/resources/sql-tests/inputs/interval.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/interval.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (afc943f -> e7f7990)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from afc943f [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If add e7f7990 [SPARK-29688][SQL] Support average for interval type values No new revisions were added by this update. Summary of changes: .../catalyst/expressions/aggregate/Average.scala | 15 ++-- .../analysis/ExpressionTypeCheckingSuite.scala | 2 +- .../test/resources/sql-tests/inputs/group-by.sql | 34 - .../resources/sql-tests/results/group-by.sql.out | 89 +- 4 files changed, 131 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d1cb98d -> afc943f)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d1cb98d [SPARK-29756][ML] CountVectorizer forget to unpersist intermediate rdd add afc943f [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/conditionalExpressions.scala | 14 +- .../catalyst/expressions/ConditionalExpressionSuite.scala | 3 ++- 2 files changed, 15 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d1cb98d -> afc943f)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d1cb98d [SPARK-29756][ML] CountVectorizer forget to unpersist intermediate rdd add afc943f [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/conditionalExpressions.scala | 14 +- .../catalyst/expressions/ConditionalExpressionSuite.scala | 3 ++- 2 files changed, 15 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7759f71 -> d1cb98d)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7759f71 [SPARK-29772][TESTS][SQL] Add withNamespace in SQLTestUtils add d1cb98d [SPARK-29756][ML] CountVectorizer forget to unpersist intermediate rdd No new revisions were added by this update. Summary of changes: .../apache/spark/ml/feature/CountVectorizer.scala | 26 ++ 1 file changed, 17 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7759f71 -> d1cb98d)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7759f71 [SPARK-29772][TESTS][SQL] Add withNamespace in SQLTestUtils add d1cb98d [SPARK-29756][ML] CountVectorizer forget to unpersist intermediate rdd No new revisions were added by this update. Summary of changes: .../apache/spark/ml/feature/CountVectorizer.scala | 26 ++ 1 file changed, 17 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org