[spark] branch master updated (7cfd589 -> 1e2d76e)

2019-11-08 Thread lixiao
This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7cfd589  [SPARK-28893][SQL] Support MERGE INTO in the parser and add 
the corresponding logical plan
 add 1e2d76e  [HOT-FIX] Fix the SQLBase.g4

No new revisions were added by this update.

Summary of changes:
 .../src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4   | 4 
 1 file changed, 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (8152a87 -> 7cfd589)

2019-11-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8152a87  [SPARK-28978][ ] Support > 256 args to python udf
 add 7cfd589  [SPARK-28893][SQL] Support MERGE INTO in the parser and add 
the corresponding logical plan

No new revisions were added by this update.

Summary of changes:
 docs/sql-keywords.md   |   2 +
 .../apache/spark/sql/catalyst/parser/SqlBase.g4|  43 +++-
 .../spark/sql/catalyst/analysis/Analyzer.scala |  45 
 .../spark/sql/catalyst/parser/AstBuilder.scala | 129 ---
 .../sql/catalyst/plans/logical/v2Commands.scala|  45 +++-
 .../spark/sql/catalyst/parser/DDLParserSuite.scala | 179 +++-
 .../spark/sql/execution/SparkStrategies.scala  |   2 +
 .../sql-tests/results/postgreSQL/comments.sql.out  |  18 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  85 
 .../execution/command/PlanResolutionSuite.scala| 235 -
 10 files changed, 742 insertions(+), 41 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-28978][ ] Support > 256 args to python udf

2019-11-08 Thread meng
This is an automated email from the ASF dual-hosted git repository.

meng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8152a87  [SPARK-28978][ ] Support > 256 args to python udf
8152a87 is described below

commit 8152a87235a63a13969f7c1ff5ed038956e8ed76
Author: Bago Amirbekian 
AuthorDate: Fri Nov 8 19:19:14 2019 -0800

[SPARK-28978][ ] Support > 256 args to python udf

### What changes were proposed in this pull request?

On the worker we express lambda functions as strings and then eval them to 
create a "mapper" function. This make the code hard to read & limits the # of 
arguments a udf can support to 256 for python <= 3.6.

This PR rewrites the mapper functions as nested functions instead of 
"lambda strings" and allows passing in more than 255 args.

### Why are the changes needed?
The jira ticket associated with this issue describes how MLflow uses udfs 
to consume columns as features. This pattern isn't unique and a limit of 255 
features is quite low.

### Does this PR introduce any user-facing change?
Users can now pass more than 255 cols to a udf function.

### How was this patch tested?
Added a unit test for passing in > 255 args to udf.

Closes #26442 from MrBago/replace-lambdas-on-worker.

Authored-by: Bago Amirbekian 
Signed-off-by: Xiangrui Meng 
---
 python/pyspark/sql/tests/test_udf.py | 13 
 python/pyspark/worker.py | 62 +---
 2 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/python/pyspark/sql/tests/test_udf.py 
b/python/pyspark/sql/tests/test_udf.py
index c274dc7..3b9f12f 100644
--- a/python/pyspark/sql/tests/test_udf.py
+++ b/python/pyspark/sql/tests/test_udf.py
@@ -629,6 +629,19 @@ class UDFTests(ReusedSQLTestCase):
 
 self.sc.parallelize(range(1), 1).mapPartitions(task).count()
 
+def test_udf_with_256_args(self):
+N = 256
+data = [["data-%d" % i for i in range(N)]] * 5
+df = self.spark.createDataFrame(data)
+
+def f(*a):
+return "success"
+
+fUdf = udf(f, StringType())
+
+r = df.select(fUdf(*df.columns))
+self.assertEqual(r.first()[0], "success")
+
 
 class UDFInitializationTests(unittest.TestCase):
 def tearDown(self):
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
index 3a1200e..bfa8d97 100644
--- a/python/pyspark/worker.py
+++ b/python/pyspark/worker.py
@@ -403,54 +403,50 @@ def read_udfs(pickleSer, infile, eval_type):
 idx += offsets_len
 return parsed
 
-udfs = {}
-call_udf = []
-mapper_str = ""
 if eval_type == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
-# Create function like this:
-#   lambda a: f([a[0]], [a[0], a[1]])
-
 # We assume there is only one UDF here because grouped map doesn't
 # support combining multiple UDFs.
 assert num_udfs == 1
 
 # See FlatMapGroupsInPandasExec for how arg_offsets are used to
 # distinguish between grouping attributes and data attributes
-arg_offsets, udf = read_single_udf(
-pickleSer, infile, eval_type, runner_conf, udf_index=0)
-udfs['f'] = udf
+arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, 
runner_conf, udf_index=0)
 parsed_offsets = extract_key_value_indexes(arg_offsets)
-keys = ["a[%d]" % (o,) for o in parsed_offsets[0][0]]
-vals = ["a[%d]" % (o, ) for o in parsed_offsets[0][1]]
-mapper_str = "lambda a: f([%s], [%s])" % (", ".join(keys), ", 
".join(vals))
+
+# Create function like this:
+#   mapper a: f([a[0]], [a[0], a[1]])
+def mapper(a):
+keys = [a[o] for o in parsed_offsets[0][0]]
+vals = [a[o] for o in parsed_offsets[0][1]]
+return f(keys, vals)
 elif eval_type == PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF:
 # We assume there is only one UDF here because cogrouped map doesn't
 # support combining multiple UDFs.
 assert num_udfs == 1
-arg_offsets, udf = read_single_udf(
-pickleSer, infile, eval_type, runner_conf, udf_index=0)
-udfs['f'] = udf
+arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, 
runner_conf, udf_index=0)
+
 parsed_offsets = extract_key_value_indexes(arg_offsets)
-df1_keys = ["a[0][%d]" % (o, ) for o in parsed_offsets[0][0]]
-df1_vals = ["a[0][%d]" % (o, ) for o in parsed_offsets[0][1]]
-df2_keys = ["a[1][%d]" % (o, ) for o in parsed_offsets[1][0]]
-df2_vals = ["a[1][%d]" % (o, ) for o in parsed_offsets[1][1]]
-mapper_str = "lambda a: f([%s], [%s], [%s], [%s])" % (
-", ".join(df1_keys), ", ".join(df1_vals), ", ".join(df2_keys), ", 
".join(df2_vals))
+
+def 

[spark] branch master updated (7fc9db0 -> 70987d8)

2019-11-08 Thread viirya
This is an automated email from the ASF dual-hosted git repository.

viirya pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7fc9db0  [SPARK-29798][PYTHON][SQL] Infers bytes as binary type in 
createDataFrame in Python 3 at PySpark
 add 70987d8  [SPARK-29680][SQL][FOLLOWUP] Replace qualifiedName with 
multipartIdentifier

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/parser/SqlBase.g4| 30 ++---
 .../spark/sql/catalyst/parser/AstBuilder.scala | 38 +++---
 .../sql/catalyst/parser/ErrorParserSuite.scala | 26 +++
 .../spark/sql/execution/SparkSqlParser.scala   | 14 
 4 files changed, 77 insertions(+), 31 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (0bdadba -> 7fc9db0)

2019-11-08 Thread cutlerb
This is an automated email from the ASF dual-hosted git repository.

cutlerb pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0bdadba  [SPARK-29790][DOC] Note required port for Kube API
 add 7fc9db0  [SPARK-29798][PYTHON][SQL] Infers bytes as binary type in 
createDataFrame in Python 3 at PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_types.py | 16 
 python/pyspark/sql/types.py|  5 +
 2 files changed, 21 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [SPARK-29790][DOC] Note required port for Kube API

2019-11-08 Thread vanzin
This is an automated email from the ASF dual-hosted git repository.

vanzin pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new bef7c0f  [SPARK-29790][DOC] Note required port for Kube API
bef7c0f is described below

commit bef7c0fddbddf1cc9d22b3ea60153b7e1bf8809d
Author: Emil Sandstø 
AuthorDate: Fri Nov 8 09:32:29 2019 -0800

[SPARK-29790][DOC] Note required port for Kube API

It adds a note about the required port of a master url in Kubernetes.

Currently a port needs to be specified for the Kubernetes API. Also in case 
the API is hosted on the HTTPS port. Else the driver might fail with 
https://medium.com/kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d

Yes, a change to the "Running on Kubernetes" guide.

None - Documentation change

Closes #26426 from Tapped/patch-1.

Authored-by: Emil Sandstø 
Signed-off-by: Marcelo Vanzin 
(cherry picked from commit 0bdadba5e3810f8e3f5da13e2a598071cbadab94)
Signed-off-by: Marcelo Vanzin 
---
 docs/running-on-kubernetes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index 0277043..037e1d5 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -103,7 +103,7 @@ $ bin/spark-submit \
 ```
 
 The Spark master, specified either via passing the `--master` command line 
argument to `spark-submit` or by setting
-`spark.master` in the application's configuration, must be a URL with the 
format `k8s://`. Prefixing the
+`spark.master` in the application's configuration, must be a URL with the 
format `k8s://:`. The port must always be 
specified, even if it's the HTTPS port 443. Prefixing the
 master string with `k8s://` will cause the Spark application to launch on the 
Kubernetes cluster, with the API server
 being contacted at `api_server_url`. If no HTTP protocol is specified in the 
URL, it defaults to `https`. For example,
 setting the master to `k8s://example.com:443` is equivalent to setting it to 
`k8s://https://example.com:443`, but to


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (e026412 -> 0bdadba)

2019-11-08 Thread vanzin
This is an automated email from the ASF dual-hosted git repository.

vanzin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e026412  [SPARK-29679][SQL] Make interval type comparable and orderable
 add 0bdadba  [SPARK-29790][DOC] Note required port for Kube API

No new revisions were added by this update.

Summary of changes:
 docs/running-on-kubernetes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (e026412 -> 0bdadba)

2019-11-08 Thread vanzin
This is an automated email from the ASF dual-hosted git repository.

vanzin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e026412  [SPARK-29679][SQL] Make interval type comparable and orderable
 add 0bdadba  [SPARK-29790][DOC] Note required port for Kube API

No new revisions were added by this update.

Summary of changes:
 docs/running-on-kubernetes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (e7f7990 -> e026412)

2019-11-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e7f7990  [SPARK-29688][SQL] Support average for interval type values
 add e026412  [SPARK-29679][SQL] Make interval type comparable and orderable

No new revisions were added by this update.

Summary of changes:
 .../spark/unsafe/types/CalendarInterval.java   |  25 ++-
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |   5 +
 .../expressions/codegen/CodeGenerator.scala|   1 +
 .../spark/sql/catalyst/expressions/ordering.scala  |   1 +
 .../apache/spark/sql/catalyst/util/TypeUtils.scala |   1 +
 .../spark/sql/types/CalendarIntervalType.scala |   3 +
 .../test/resources/sql-tests/inputs/interval.sql   |  43 +
 .../resources/sql-tests/results/interval.sql.out   | 180 +
 8 files changed, 258 insertions(+), 1 deletion(-)
 create mode 100644 sql/core/src/test/resources/sql-tests/inputs/interval.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/interval.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (afc943f -> e7f7990)

2019-11-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from afc943f  [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If
 add e7f7990  [SPARK-29688][SQL] Support average for interval type values

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/aggregate/Average.scala   | 15 ++--
 .../analysis/ExpressionTypeCheckingSuite.scala |  2 +-
 .../test/resources/sql-tests/inputs/group-by.sql   | 34 -
 .../resources/sql-tests/results/group-by.sql.out   | 89 +-
 4 files changed, 131 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d1cb98d -> afc943f)

2019-11-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d1cb98d  [SPARK-29756][ML] CountVectorizer forget to unpersist 
intermediate rdd
 add afc943f  [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/conditionalExpressions.scala  | 14 +-
 .../catalyst/expressions/ConditionalExpressionSuite.scala  |  3 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d1cb98d -> afc943f)

2019-11-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d1cb98d  [SPARK-29756][ML] CountVectorizer forget to unpersist 
intermediate rdd
 add afc943f  [SPARK-28477][SQL] Rewrite CaseWhen with single branch to If

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/conditionalExpressions.scala  | 14 +-
 .../catalyst/expressions/ConditionalExpressionSuite.scala  |  3 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (7759f71 -> d1cb98d)

2019-11-08 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7759f71  [SPARK-29772][TESTS][SQL] Add withNamespace in SQLTestUtils
 add d1cb98d  [SPARK-29756][ML] CountVectorizer forget to unpersist 
intermediate rdd

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ml/feature/CountVectorizer.scala  | 26 ++
 1 file changed, 17 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (7759f71 -> d1cb98d)

2019-11-08 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7759f71  [SPARK-29772][TESTS][SQL] Add withNamespace in SQLTestUtils
 add d1cb98d  [SPARK-29756][ML] CountVectorizer forget to unpersist 
intermediate rdd

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ml/feature/CountVectorizer.scala  | 26 ++
 1 file changed, 17 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org