[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...
Github user eyalfa commented on a diff in the pull request: https://github.com/apache/spark/pull/21369#discussion_r192631230 --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala --- @@ -414,7 +415,106 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite with LocalSparkContext { sc.stop() } - test("external aggregation updates peak execution memory") { + test("SPARK-22713 spill during iteration leaks internal map") { +val size = 1000 +val conf = createSparkConf(loadDefaults = true) +sc = new SparkContext("local-cluster[1,1,1024]", "test", conf) +val map = createExternalMap[Int] --- End diff -- @cloud-fan , can we move on with this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3789/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13599 **[Test build #91437 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91437/testReport)** for PR 13599 at commit [`9f32a2f`](https://github.com/apache/spark/commit/9f32a2f3c1277652856e59997af83a2bbe91ce74). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21156: [SPARK-24087][SQL] Avoid shuffle when join keys are a su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21156 **[Test build #91436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91436/testReport)** for PR 21156 at commit [`946688a`](https://github.com/apache/spark/commit/946688aee3d03d37a57270e654e00bb9236f21c4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21156: [SPARK-24087][SQL] Avoid shuffle when join keys are a su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21156 **[Test build #91435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91435/testReport)** for PR 21156 at commit [`fa76a78`](https://github.com/apache/spark/commit/fa76a7823baf4e6eb05f33bc746ade7f65f44372). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15297 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21313: [SPARK-24187][R][SQL]Add array_join function to S...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21313#discussion_r192627641 --- Diff: R/pkg/R/functions.R --- @@ -3006,6 +3008,27 @@ setMethod("array_contains", column(jc) }) +#' @details +#' \code{array_join}: Concatenates the elements of column using the delimiter. +#' Null values are replaced with nullReplacement if set, otherwise they are ignored. +#' +#' @param delimiter a character string that is used to concatenate the elements of column. +#' @param nullReplacement a character string that is used to replace the Null values. +#' @rdname column_collection_functions +#' @aliases array_join array_join,Column-method +#' @note array_join since 2.4.0 +setMethod("array_join", + signature(x = "Column", delimiter = "character"), + function(x, delimiter, nullReplacement = NULL) { + jc <- if (is.null(nullReplacement)) { + callJStatic("org.apache.spark.sql.functions", "array_join", x@jc, delimiter) + } else { + callJStatic("org.apache.spark.sql.functions", "array_join", x@jc, delimiter, + nullReplacement) --- End diff -- `as.character(nullReplacement)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21313: [SPARK-24187][R][SQL]Add array_join function to S...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21313#discussion_r192627948 --- Diff: R/pkg/R/functions.R --- @@ -3006,6 +3008,27 @@ setMethod("array_contains", column(jc) }) +#' @details +#' \code{array_join}: Concatenates the elements of column using the delimiter. +#' Null values are replaced with nullReplacement if set, otherwise they are ignored. +#' +#' @param delimiter a character string that is used to concatenate the elements of column. +#' @param nullReplacement a character string that is used to replace the Null values. +#' @rdname column_collection_functions +#' @aliases array_join array_join,Column-method +#' @note array_join since 2.4.0 +setMethod("array_join", + signature(x = "Column", delimiter = "character"), + function(x, delimiter, nullReplacement = NULL) { + jc <- if (is.null(nullReplacement)) { + callJStatic("org.apache.spark.sql.functions", "array_join", x@jc, delimiter) + } else { + callJStatic("org.apache.spark.sql.functions", "array_join", x@jc, delimiter, + nullReplacement) --- End diff -- re https://github.com/apache/spark/pull/21313#discussion_r192578750 so what's the behavior if delimiter is more than one character? like `array_join(df$a, "#Foo", "Beautiful")`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20894: [SPARK-23786][SQL] Checking column names of csv h...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20894 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20894 LGTM Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21487: [SPARK-24369][SQL] Correct handling for multiple ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21487 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21487 Thanks! Merged to master and 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21156: [SPARK-24087][SQL] Avoid shuffle when join keys are a su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21156 **[Test build #91434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91434/testReport)** for PR 21156 at commit [`4e026e5`](https://github.com/apache/spark/commit/4e026e5e437dc7f578434244b55bb1ebe189bace). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21486: [SPARK-24387][Core] Heartbeat-timeout executor is...
Github user lirui-apache commented on a diff in the pull request: https://github.com/apache/spark/pull/21486#discussion_r192614888 --- Diff: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala --- @@ -197,14 +197,14 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) // Asynchronously kill the executor to avoid blocking the current thread killExecutorThread.submit(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { // Note: we want to get an executor back after expiring this one, // so do not simply call `sc.killExecutor` here (SPARK-8119) sc.killAndReplaceExecutor(executorId) --- End diff -- Yes: ``` private[spark] def killAndReplaceExecutor(executorId: String): Boolean = { schedulerBackend match { case b: ExecutorAllocationClient => b.killExecutors(Seq(executorId), adjustTargetNumExecutors = false, countFailures = true, force = true).nonEmpty case _ => logWarning("Killing executors is not supported by current scheduler.") false } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21486: [SPARK-24387][Core] Heartbeat-timeout executor is...
Github user lirui-apache commented on a diff in the pull request: https://github.com/apache/spark/pull/21486#discussion_r192614874 --- Diff: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala --- @@ -207,6 +210,55 @@ class HeartbeatReceiverSuite assert(fakeClusterManager.getExecutorIdsToKill === Set(executorId1, executorId2)) } + test("expired host should not be offered again") { +scheduler = spy(new TaskSchedulerImpl(sc)) +scheduler.setDAGScheduler(sc.dagScheduler) +when(sc.taskScheduler).thenReturn(scheduler) +doReturn(true).when(scheduler).executorHeartbeatReceived(any(), any(), any()) + +// Set up a fake backend and cluster manager to simulate killing executors +val rpcEnv = sc.env.rpcEnv +val fakeClusterManager = new FakeClusterManager(rpcEnv) +val fakeClusterManagerRef = rpcEnv.setupEndpoint("fake-cm", fakeClusterManager) +val fakeSchedulerBackend = new FakeSchedulerBackend(scheduler, rpcEnv, fakeClusterManagerRef) +when(sc.schedulerBackend).thenReturn(fakeSchedulerBackend) + +fakeSchedulerBackend.start() +val dummyExecutorEndpoint1 = new FakeExecutorEndpoint(rpcEnv) +val dummyExecutorEndpointRef1 = rpcEnv.setupEndpoint("fake-executor-1", dummyExecutorEndpoint1) +fakeSchedulerBackend.driverEndpoint.askSync[Boolean]( + RegisterExecutor(executorId1, dummyExecutorEndpointRef1, "1.2.3.4", 2, Map.empty)) +heartbeatReceiverRef.askSync[Boolean](TaskSchedulerIsSet) +addExecutorAndVerify(executorId1) +triggerHeartbeat(executorId1, executorShouldReregister = false) + +scheduler.initialize(fakeSchedulerBackend) +sc.requestTotalExecutors(0, 0, Map.empty) --- End diff -- Just thought we don't need a replacement executor for this test case. Will update total num to 1 to be closer to real use case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21486: [SPARK-24387][Core] Heartbeat-timeout executor is...
Github user lirui-apache commented on a diff in the pull request: https://github.com/apache/spark/pull/21486#discussion_r192614878 --- Diff: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala --- @@ -207,6 +210,55 @@ class HeartbeatReceiverSuite assert(fakeClusterManager.getExecutorIdsToKill === Set(executorId1, executorId2)) } + test("expired host should not be offered again") { --- End diff -- Should be `executor`. Thanks for catching, will update --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21467: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21467 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91433/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21467: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21467 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21467: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21467 **[Test build #91433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91433/testReport)** for PR 21467 at commit [`8505de2`](https://github.com/apache/spark/commit/8505de28231e99b63371d0798545b693692cbce4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21467: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21467 **[Test build #91433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91433/testReport)** for PR 21467 at commit [`8505de2`](https://github.com/apache/spark/commit/8505de28231e99b63371d0798545b693692cbce4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21467: [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21467 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r192610559 --- Diff: python/pyspark/sql/dataframe.py --- @@ -351,8 +354,70 @@ def show(self, n=20, truncate=True, vertical=False): else: print(self._jdf.showString(n, int(truncate), vertical)) +@property +def _eager_eval(self): +"""Returns true if the eager evaluation enabled. +""" +return self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.enabled", "false").lower() == "true" + +@property +def _max_num_rows(self): +"""Returns the max row number for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.maxNumRows", "20")) + +@property +def _truncate(self): +"""Returns the truncate length for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.truncate", "20")) + def __repr__(self): -return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +if not self._support_repr_html and self._eager_eval: +vertical = False +return self._jdf.showString( +self._max_num_rows, self._truncate, vertical) +else: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) + +def _repr_html_(self): +"""Returns a dataframe with html code when you enabled eager evaluation +by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are +using support eager evaluation with HTML. +""" +import cgi +if not self._support_repr_html: +self._support_repr_html = True +if self._eager_eval: +max_num_rows = max(self._max_num_rows, 0) +with SCCallSiteSync(self._sc) as css: +vertical = False +sock_info = self._jdf.getRowsToPython( +max_num_rows, self._truncate, vertical) +rows = list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer( +head = rows[0] +row_data = rows[1:] +has_more_data = len(row_data) > max_num_rows +row_data = row_data[0:max_num_rows] + +html = "\n" +# generate table head +html += "".join(map(lambda x: cgi.escape(x), head)) + "\n" +# generate table rows +for row in row_data: +data = "" + "".join(map(lambda x: cgi.escape(x), row)) + \ +"\n" --- End diff -- ditto: ``` "%s\n" % "".join(map(lambda x: cgi.escape(x), row)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r192610390 --- Diff: python/pyspark/sql/dataframe.py --- @@ -351,8 +354,70 @@ def show(self, n=20, truncate=True, vertical=False): else: print(self._jdf.showString(n, int(truncate), vertical)) +@property +def _eager_eval(self): +"""Returns true if the eager evaluation enabled. +""" +return self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.enabled", "false").lower() == "true" + +@property +def _max_num_rows(self): +"""Returns the max row number for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.maxNumRows", "20")) + +@property +def _truncate(self): +"""Returns the truncate length for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.truncate", "20")) + def __repr__(self): -return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +if not self._support_repr_html and self._eager_eval: +vertical = False +return self._jdf.showString( +self._max_num_rows, self._truncate, vertical) +else: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) + +def _repr_html_(self): +"""Returns a dataframe with html code when you enabled eager evaluation +by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are +using support eager evaluation with HTML. +""" +import cgi +if not self._support_repr_html: +self._support_repr_html = True +if self._eager_eval: +max_num_rows = max(self._max_num_rows, 0) +with SCCallSiteSync(self._sc) as css: +vertical = False +sock_info = self._jdf.getRowsToPython( +max_num_rows, self._truncate, vertical) +rows = list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer( +head = rows[0] +row_data = rows[1:] +has_more_data = len(row_data) > max_num_rows +row_data = row_data[0:max_num_rows] --- End diff -- tiny nit: `row_data[:max_num_rows]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r192610512 --- Diff: python/pyspark/sql/dataframe.py --- @@ -351,8 +354,70 @@ def show(self, n=20, truncate=True, vertical=False): else: print(self._jdf.showString(n, int(truncate), vertical)) +@property +def _eager_eval(self): +"""Returns true if the eager evaluation enabled. +""" +return self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.enabled", "false").lower() == "true" + +@property +def _max_num_rows(self): +"""Returns the max row number for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.maxNumRows", "20")) + +@property +def _truncate(self): +"""Returns the truncate length for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.truncate", "20")) + def __repr__(self): -return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +if not self._support_repr_html and self._eager_eval: +vertical = False +return self._jdf.showString( +self._max_num_rows, self._truncate, vertical) +else: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) + +def _repr_html_(self): +"""Returns a dataframe with html code when you enabled eager evaluation +by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are +using support eager evaluation with HTML. +""" +import cgi +if not self._support_repr_html: +self._support_repr_html = True +if self._eager_eval: +max_num_rows = max(self._max_num_rows, 0) +with SCCallSiteSync(self._sc) as css: +vertical = False +sock_info = self._jdf.getRowsToPython( +max_num_rows, self._truncate, vertical) +rows = list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer( +head = rows[0] +row_data = rows[1:] +has_more_data = len(row_data) > max_num_rows +row_data = row_data[0:max_num_rows] + +html = "\n" +# generate table head +html += "".join(map(lambda x: cgi.escape(x), head)) + "\n" --- End diff -- maybe: ``` "%s\n" % "".join(map(lambda x: cgi.escape(x), head)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r192610308 --- Diff: python/pyspark/sql/dataframe.py --- @@ -351,8 +354,70 @@ def show(self, n=20, truncate=True, vertical=False): else: print(self._jdf.showString(n, int(truncate), vertical)) +@property +def _eager_eval(self): +"""Returns true if the eager evaluation enabled. +""" +return self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.enabled", "false").lower() == "true" + +@property +def _max_num_rows(self): +"""Returns the max row number for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.maxNumRows", "20")) + +@property +def _truncate(self): +"""Returns the truncate length for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.truncate", "20")) + def __repr__(self): -return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +if not self._support_repr_html and self._eager_eval: +vertical = False +return self._jdf.showString( +self._max_num_rows, self._truncate, vertical) +else: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) + +def _repr_html_(self): +"""Returns a dataframe with html code when you enabled eager evaluation +by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are +using support eager evaluation with HTML. +""" +import cgi +if not self._support_repr_html: +self._support_repr_html = True +if self._eager_eval: +max_num_rows = max(self._max_num_rows, 0) +with SCCallSiteSync(self._sc) as css: --- End diff -- `css` seems not used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r192610839 --- Diff: python/pyspark/sql/tests.py --- @@ -3040,6 +3040,36 @@ def test_csv_sampling_ratio(self): .csv(rdd, samplingRatio=0.5).schema self.assertEquals(schema, StructType([StructField("_c0", IntegerType(), True)])) +def test_repr_html(self): +import re +pattern = re.compile(r'^ *\|', re.MULTILINE) +df = self.spark.createDataFrame([(1, "1"), (2, "2")], ("key", "value")) +self.assertEquals(None, df._repr_html_()) +self.spark.conf.set("spark.sql.repl.eagerEval.enabled", "true") --- End diff -- Can we use `with self.sql_conf(...)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r192610620 --- Diff: python/pyspark/sql/dataframe.py --- @@ -351,8 +354,70 @@ def show(self, n=20, truncate=True, vertical=False): else: print(self._jdf.showString(n, int(truncate), vertical)) +@property +def _eager_eval(self): +"""Returns true if the eager evaluation enabled. +""" +return self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.enabled", "false").lower() == "true" + +@property +def _max_num_rows(self): +"""Returns the max row number for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.maxNumRows", "20")) + +@property +def _truncate(self): +"""Returns the truncate length for eager evaluation. +""" +return int(self.sql_ctx.getConf( +"spark.sql.repl.eagerEval.truncate", "20")) + def __repr__(self): -return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +if not self._support_repr_html and self._eager_eval: +vertical = False +return self._jdf.showString( +self._max_num_rows, self._truncate, vertical) +else: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) + +def _repr_html_(self): +"""Returns a dataframe with html code when you enabled eager evaluation +by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are +using support eager evaluation with HTML. +""" +import cgi +if not self._support_repr_html: +self._support_repr_html = True +if self._eager_eval: +max_num_rows = max(self._max_num_rows, 0) +with SCCallSiteSync(self._sc) as css: +vertical = False +sock_info = self._jdf.getRowsToPython( +max_num_rows, self._truncate, vertical) +rows = list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer( +head = rows[0] +row_data = rows[1:] +has_more_data = len(row_data) > max_num_rows +row_data = row_data[0:max_num_rows] + +html = "\n" +# generate table head +html += "".join(map(lambda x: cgi.escape(x), head)) + "\n" +# generate table rows +for row in row_data: +data = "" + "".join(map(lambda x: cgi.escape(x), row)) + \ +"\n" +html += data +html += "\n" +if has_more_data: +html += "only showing top %d %s\n" % ( --- End diff -- I'd just way `row(s)`. Don't have to be super clever on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21485 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21485 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21487 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3788/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91432/testReport)** for PR 21488 at commit [`062c6d0`](https://github.com/apache/spark/commit/062c6d00d9a7625a97d85f677ea03cc695d9c0dc). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91432/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91432/testReport)** for PR 21488 at commit [`062c6d0`](https://github.com/apache/spark/commit/062c6d00d9a7625a97d85f677ea03cc695d9c0dc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21313: [SPARK-24187][R][SQL]Add array_join function to SparkR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21313 **[Test build #91430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91430/testReport)** for PR 21313 at commit [`901ff32`](https://github.com/apache/spark/commit/901ff32a03c6ec0c16a0ff7c625781ccf2355a54). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21313: [SPARK-24187][R][SQL]Add array_join function to SparkR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21313 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21313: [SPARK-24187][R][SQL]Add array_join function to SparkR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91430/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3787/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91431/testReport)** for PR 21488 at commit [`f76da89`](https://github.com/apache/spark/commit/f76da891e6eb004f92b0c423ba371ec971da7584). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91431/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91431/testReport)** for PR 21488 at commit [`f76da89`](https://github.com/apache/spark/commit/f76da891e6eb004f92b0c423ba371ec971da7584). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21313: [SPARK-24187][R][SQL]Add array_join function to SparkR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3786/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21313: [SPARK-24187][R][SQL]Add array_join function to SparkR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21313 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21485 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91428/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21485 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21313: [SPARK-24187][R][SQL]Add array_join function to SparkR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21313 **[Test build #91430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91430/testReport)** for PR 21313 at commit [`901ff32`](https://github.com/apache/spark/commit/901ff32a03c6ec0c16a0ff7c625781ccf2355a54). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21485 **[Test build #91428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91428/testReport)** for PR 21485 at commit [`9ca9e4b`](https://github.com/apache/spark/commit/9ca9e4bf55c94ed1a2281b0beb5b1e4df648f48f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...
Github user ijuma commented on a diff in the pull request: https://github.com/apache/spark/pull/21488#discussion_r192602632 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -96,10 +101,13 @@ class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends L // Set up the Embedded Zookeeper server and get the proper Zookeeper port private def setupEmbeddedZookeeper(): Unit = { // Zookeeper server startup -zookeeper = new EmbeddedZookeeper(s"$zkHost:$zkPort") +val zkSvr = s"$zkHost:$zkPort"; +zookeeper = new EmbeddedZookeeper(zkSvr) // Get the actual zookeeper binding port zkPort = zookeeper.actualPort -zkUtils = ZkUtils(s"$zkHost:$zkPort", zkSessionTimeout, zkConnectionTimeout, false) +zkUtils = ZkUtils(zkSvr, zkSessionTimeout, zkConnectionTimeout, false) +zkClient = KafkaZkClient(zkSvr, false, 6000, 1, Int.MaxValue, Time.SYSTEM) +adminZkClient = new AdminZkClient(zkClient) --- End diff -- AdminClient.create gives you a concrete instance. createPartitions is the method you're looking for. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/21488#discussion_r192601997 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -96,10 +101,13 @@ class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends L // Set up the Embedded Zookeeper server and get the proper Zookeeper port private def setupEmbeddedZookeeper(): Unit = { // Zookeeper server startup -zookeeper = new EmbeddedZookeeper(s"$zkHost:$zkPort") +val zkSvr = s"$zkHost:$zkPort"; +zookeeper = new EmbeddedZookeeper(zkSvr) // Get the actual zookeeper binding port zkPort = zookeeper.actualPort -zkUtils = ZkUtils(s"$zkHost:$zkPort", zkSessionTimeout, zkConnectionTimeout, false) +zkUtils = ZkUtils(zkSvr, zkSessionTimeout, zkConnectionTimeout, false) +zkClient = KafkaZkClient(zkSvr, false, 6000, 1, Int.MaxValue, Time.SYSTEM) +adminZkClient = new AdminZkClient(zkClient) --- End diff -- AdminClient is abstract. KafkaAdminClient doesn't provide addPartitions. Mind giving some pointer ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...
Github user ijuma commented on a diff in the pull request: https://github.com/apache/spark/pull/21488#discussion_r192601219 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -96,10 +101,13 @@ class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends L // Set up the Embedded Zookeeper server and get the proper Zookeeper port private def setupEmbeddedZookeeper(): Unit = { // Zookeeper server startup -zookeeper = new EmbeddedZookeeper(s"$zkHost:$zkPort") +val zkSvr = s"$zkHost:$zkPort"; +zookeeper = new EmbeddedZookeeper(zkSvr) // Get the actual zookeeper binding port zkPort = zookeeper.actualPort -zkUtils = ZkUtils(s"$zkHost:$zkPort", zkSessionTimeout, zkConnectionTimeout, false) +zkUtils = ZkUtils(zkSvr, zkSessionTimeout, zkConnectionTimeout, false) +zkClient = KafkaZkClient(zkSvr, false, 6000, 1, Int.MaxValue, Time.SYSTEM) +adminZkClient = new AdminZkClient(zkClient) --- End diff -- Can we use the Java AdminClient instead of these internal classes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3785/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91429/testReport)** for PR 21488 at commit [`0a22686`](https://github.com/apache/spark/commit/0a22686d9a388a21d5dd38513854341d3f37f738). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91429/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91429/testReport)** for PR 21488 at commit [`0a22686`](https://github.com/apache/spark/commit/0a22686d9a388a21d5dd38513854341d3f37f738). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...
GitHub user tedyu opened a pull request: https://github.com/apache/spark/pull/21488 SPARK-18057 Update structured streaming kafka from 0.10.0.1 to 2.0.0 ## What changes were proposed in this pull request? This PR upgrades to the Kafka 2.0.0 release where KIP-266 is integrated. ## How was this patch tested? This PR uses existing Kafka related unit tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tedyu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21488.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21488 commit 0a22686d9a388a21d5dd38513854341d3f37f738 Author: tedyu Date: 2018-06-03T19:54:22Z SPARK-18057 Update structured streaming kafka from 0.10.0.1 to 2.0.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error in Sentr...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21398 @vanzin should we merge this or add a comment in the docs? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21485 **[Test build #91428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91428/testReport)** for PR 21485 at commit [`9ca9e4b`](https://github.com/apache/spark/commit/9ca9e4bf55c94ed1a2281b0beb5b1e4df648f48f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21485 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21485: [SPARK-24455][CORE] fix typo in TaskSchedulerImpl commen...
Github user xueyumusic commented on the issue: https://github.com/apache/spark/pull/21485 I took another look and find some typos, please review them, @HyukjinKwon , thank you for reminding --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21486: [SPARK-24387][Core] Heartbeat-timeout executor is...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/21486#discussion_r192592142 --- Diff: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala --- @@ -207,6 +210,55 @@ class HeartbeatReceiverSuite assert(fakeClusterManager.getExecutorIdsToKill === Set(executorId1, executorId2)) } + test("expired host should not be offered again") { +scheduler = spy(new TaskSchedulerImpl(sc)) +scheduler.setDAGScheduler(sc.dagScheduler) +when(sc.taskScheduler).thenReturn(scheduler) +doReturn(true).when(scheduler).executorHeartbeatReceived(any(), any(), any()) + +// Set up a fake backend and cluster manager to simulate killing executors +val rpcEnv = sc.env.rpcEnv +val fakeClusterManager = new FakeClusterManager(rpcEnv) +val fakeClusterManagerRef = rpcEnv.setupEndpoint("fake-cm", fakeClusterManager) +val fakeSchedulerBackend = new FakeSchedulerBackend(scheduler, rpcEnv, fakeClusterManagerRef) +when(sc.schedulerBackend).thenReturn(fakeSchedulerBackend) + +fakeSchedulerBackend.start() +val dummyExecutorEndpoint1 = new FakeExecutorEndpoint(rpcEnv) +val dummyExecutorEndpointRef1 = rpcEnv.setupEndpoint("fake-executor-1", dummyExecutorEndpoint1) +fakeSchedulerBackend.driverEndpoint.askSync[Boolean]( + RegisterExecutor(executorId1, dummyExecutorEndpointRef1, "1.2.3.4", 2, Map.empty)) +heartbeatReceiverRef.askSync[Boolean](TaskSchedulerIsSet) +addExecutorAndVerify(executorId1) +triggerHeartbeat(executorId1, executorShouldReregister = false) + +scheduler.initialize(fakeSchedulerBackend) +sc.requestTotalExecutors(0, 0, Map.empty) --- End diff -- why request 0 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21486: [SPARK-24387][Core] Heartbeat-timeout executor is...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/21486#discussion_r192592170 --- Diff: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala --- @@ -207,6 +210,55 @@ class HeartbeatReceiverSuite assert(fakeClusterManager.getExecutorIdsToKill === Set(executorId1, executorId2)) } + test("expired host should not be offered again") { --- End diff -- Also, better to attach JIRA number. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21486: [SPARK-24387][Core] Heartbeat-timeout executor is...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/21486#discussion_r192591845 --- Diff: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala --- @@ -207,6 +210,55 @@ class HeartbeatReceiverSuite assert(fakeClusterManager.getExecutorIdsToKill === Set(executorId1, executorId2)) } + test("expired host should not be offered again") { --- End diff -- `host` or `executor` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21449 This will definitely not go into 2.3.1, so we have plenty of time. I'll think deeper into it after the spark summit. IMO `df.join(df, df("id") >= df("id"))` is ambiguous, especially when it's not an inner join. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21487 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91427/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21487 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21487 **[Test build #91427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91427/testReport)** for PR 21487 at commit [`8386b42`](https://github.com/apache/spark/commit/8386b4250d90eb369c85f02de7bbabe7a2ebbdaa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21231: [SPARK-24119][SQL]Add interpreted execution to SortPrefi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21231 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21487 LGTM (I checked the related tests passed in my local). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192584018 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -747,8 +748,13 @@ case class StructsToJson( object JsonExprUtils { - def validateSchemaLiteral(exp: Expression): StructType = exp match { -case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) + def validateSchemaLiteral(exp: Expression): DataType = exp match { +case Literal(s, StringType) => + try { +DataType.fromJson(s.toString) --- End diff -- If possible, I like @HyukjinKwon 's approach. I remember correctly we just keep json schema formats for back-compatibility. In future major releases, I think we possibly drop the support. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21481: [SPARK-24452][SQL][Core] Avoid possible overflow in int ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21481 Good questions. For 2, at first I found one of these issues when I looked at a file. Then, I ran `grep` command with `long .*=.*\*` and `long .*=.*\+` in `.java` file. Then, I picked them up manually. It looks labor-intensive. For 3, here is my thought. [`SpotBugs`](https://spotbugs.github.io/) may be a good candidate to check it. SpotBug is a successor of [`findBugs`](https://findbugs.sourceforge.net/). When I ran `FindBugs` before, I found some problems regarding possible overflow and then made a PR that was integrated. On the other hand, these issues may not be detected at that time. I will look at SpotBugs after my presentation at SAIS will be finished :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21472: [SPARK-24445][SQL] Schema in json format for from...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21472#discussion_r192583925 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -25,6 +25,10 @@ select from_json('{"a":1}', 'a InvalidType'); select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_json('{"a":1}', 'a INT', map('mode', 1)); select from_json(); +-- from_json - schema in json format +select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}'); +select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}'); + --- End diff -- To make the output file changes smaller, can you add new tests in the end of file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21439 > What kind of tests would you expect in json-functions.sql. Probably you would expect tests > that are different from added to JsonExpressionsSuite.scala. IIUC there is no strict rule there and we tend to put SQL related tests in `SQLQueryTestSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21449 I see what you mean. Honestly I have not thought of a full design for this problem (so I can't state what we should support and what not), but focusing on this specific case I think that: - at the moment we do support self-joins (at least in the case `df.join(df, df("id") >= df("id"))`) so considering this invalid would cause a big behavior change (potentially causing user workflows to break). - even though we might consider acceptable such a change in a major release, I think that we should support with the Dataframe API what we support in the SQL API, and SQL standard supports self joins (using aliases for the tables). So I do believe we should support this use case. - the case presented by @daniel-shields in https://github.com/apache/spark/pull/21449#issuecomment-392947474, I think is a valid one without any doubt. As of now we are not supporting it, though. So I think that in the holistic approach we shouldn't change the current behavior/approach which is present now and will be (IMHO) improved by this patch. What I do think we have to discuss in order not to have to change it - once we want to solve the more generic issue - is the way to track the dataset an attribute is coming from. Here I decided to use the metadata, since I thought this is the cleanest approach. Another approach might be to introduce a new `Option` in the `AttributeReference` a reference to the dataset it is coming from. For the generic solution, this might have the advantage that having a reference to the provenance dataset, where we might want to store some kind of DAG of the datasets this one is coming from in order to take more complex decision about the validity of the syntax and/or the resolution of the attribute. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21487 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3784/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21487 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21487 **[Test build #91427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91427/testReport)** for PR 21487 at commit [`8386b42`](https://github.com/apache/spark/commit/8386b4250d90eb369c85f02de7bbabe7a2ebbdaa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21487 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21487 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21487 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91425/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21487: [SPARK-24369][SQL] Correct handling for multiple distinc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21487 **[Test build #91425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91425/testReport)** for PR 21487 at commit [`8386b42`](https://github.com/apache/spark/commit/8386b4250d90eb369c85f02de7bbabe7a2ebbdaa). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21313: [SPARK-24187][R][SQL]Add array_join function to S...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21313#discussion_r192578796 --- Diff: R/pkg/R/functions.R --- @@ -3006,6 +3008,27 @@ setMethod("array_contains", column(jc) }) +#' @details +#' \code{array_join}: Concatenates the elements of column using the delimiter. +#' Null values are replaced with nullReplacement if set, otherwise they are ignored. +#' +#' @param delimiter character(s) to use to concatenate the elements of column. +#' @param nullReplacement character(s) to use to replace the Null values. +#' @rdname column_collection_functions +#' @aliases array_join array_join,Column-method +#' @note array_join since 2.4.0 +setMethod("array_join", + signature(x = "Column"), --- End diff -- @felixcheung I will add the type so it will be ``` signature(x = "Column", delimiter = "character"), ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21313: [SPARK-24187][R][SQL]Add array_join function to S...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21313#discussion_r192578774 --- Diff: R/pkg/R/functions.R --- @@ -3006,6 +3008,27 @@ setMethod("array_contains", column(jc) }) +#' @details +#' \code{array_join}: Concatenates the elements of column using the delimiter. +#' Null values are replaced with nullReplacement if set, otherwise they are ignored. +#' +#' @param delimiter character(s) to use to concatenate the elements of column. +#' @param nullReplacement character(s) to use to replace the Null values. +#' @rdname column_collection_functions +#' @aliases array_join array_join,Column-method +#' @note array_join since 2.4.0 +setMethod("array_join", + signature(x = "Column"), + function(x, delimiter, nullReplacement = NA) { --- End diff -- @felixcheung I will change the default to NULL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21313: [SPARK-24187][R][SQL]Add array_join function to S...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21313#discussion_r192578750 --- Diff: R/pkg/R/functions.R --- @@ -3006,6 +3008,27 @@ setMethod("array_contains", column(jc) }) +#' @details +#' \code{array_join}: Concatenates the elements of column using the delimiter. +#' Null values are replaced with nullReplacement if set, otherwise they are ignored. +#' +#' @param delimiter character(s) to use to concatenate the elements of column. --- End diff -- @felixcheung scala doesn't have a doc for param delimiter. I added this myself. What I am trying to say is "one or more characters". I will change to "a character string" so it will be ``` @param delimiter a character string to use to concatenate the elements of column. ``` Does this look ok to you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19602: [SPARK-22384][SQL] Refine partition pruning when attribu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19602 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91426/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19602: [SPARK-22384][SQL] Refine partition pruning when attribu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19602 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19602: [SPARK-22384][SQL] Refine partition pruning when attribu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19602 **[Test build #91426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91426/testReport)** for PR 19602 at commit [`e4c6e1f`](https://github.com/apache/spark/commit/e4c6e1ff713a7033b0a60dabaca5071b480d7600). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org