[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19509#discussion_r145329972 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala --- @@ -347,6 +347,10 @@ package object config { .timeConf(TimeUnit.MILLISECONDS) .createWithDefault(Long.MaxValue) + private[spark] val KERBEROS_RELOGIN_PERIOD = ConfigBuilder("spark.yarn.kerberos.relogin.period") +.timeConf(TimeUnit.SECONDS) +.createWithDefaultString("1m") --- End diff -- I think we should put this into doc. Also is it too frequent to call? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...
GitHub user hhbyyh opened a pull request: https://github.com/apache/spark/pull/19525 [SPARK-22289] [ML] Add JSON support for Matrix parameters (LR with coefficients bound) ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-22289 add JSON encode/decode for Param[Matrix]. The issue was reported by Nic Eggert during saving LR model with LowerBoundsOnCoefficients. There're two ways to resolve this as I see: 1. Support save/load on LogisticRegressionParams, and also adjust the save/load in LogisticRegression and LogisticRegressionModel. 2. Directly support Matrix in Param.jsonEncode, similar to what we have done for Vector. After some discussion in jira, we prefer the fix to support Matrix as a valid Param type, for simplicity and convenience for other classes. ## How was this patch tested? new unit test to cover the LR case and JsonMatrixConverter You can merge this pull request into a Git repository by running: $ git pull https://github.com/hhbyyh/spark lrsave Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19525 commit 92f599d3fb7159cfc1694d2c65321dd2abdc4ccc Author: Yuhao Yang Date: 2017-10-18T06:49:01Z add json for Matrix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19525 **[Test build #82875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82875/testReport)** for PR 19525 at commit [`92f599d`](https://github.com/apache/spark/commit/92f599d3fb7159cfc1694d2c65321dd2abdc4ccc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19488 **[Test build #82874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82874/testReport)** for PR 19488 at commit [`506c410`](https://github.com/apache/spark/commit/506c410eefb5131c6cf50e947ff01e0bda9c28fe). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r145330685 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -122,17 +124,33 @@ private[ml] object Param { /** Decodes a param value from JSON. */ def jsonDecode[T](json: String): T = { -parse(json) match { +val jValue = parse(json) +jValue match { case JString(x) => x.asInstanceOf[T] case JObject(v) => val keys = v.map(_._1) -assert(keys.contains("type") && keys.contains("values"), - s"Expect a JSON serialized vector but cannot find fields 'type' and 'values' in $json.") -JsonVectorConverter.fromJson(json).asInstanceOf[T] +if (keys.contains("class")) { + implicit val formats = DefaultFormats + val className = (jValue \ "class").extract[String] + className match { +case JsonMatrixConverter.className => + val checkFields = Array("numRows", "numCols", "values", "isTransposed") + require(checkFields.forall(keys.contains), s"Expect a JSON serialized Matrix" + +s" but cannot find fields ${checkFields.mkString(", ")} in $json.") + JsonMatrixConverter.fromJson(json).asInstanceOf[T] + +case s => throw new SparkException(s"unrecognized class $s in $json") + } +} else { // Vector does not have class info in json + require(keys.contains("type") && keys.contains("values"), s"Expect a JSON serialized" + +s" vector/matrix but cannot find fields 'type' and 'values' in $json.") --- End diff -- /matrix should be removed here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19522 **[Test build #82873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82873/testReport)** for PR 19522 at commit [`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19522 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19522 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82873/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82874/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19509 LGTM, just one minor comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19522 **[Test build #3952 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3952/testReport)** for PR 19522 at commit [`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19360: [SPARK-22139][CORE]Remove the variable which is n...
GitHub user guoxiaolongzte reopened a pull request: https://github.com/apache/spark/pull/19360 [SPARK-22139][CORE]Remove the variable which is never used in SparkConf.scala ## What changes were proposed in this pull request? Remove the variable which is never used in SparkConf.scala. val executorClasspathKey = "spark.executor.extraClassPath" val driverOptsKey = "spark.driver.extraJavaOptions" val driverClassPathKey = "spark.driver.extraClassPath" val sparkExecutorInstances = "spark.executor.instances" They variables are never used. Because the implementation code for the validation rule has been removed in SPARK-17979. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guoxiaolongzte/spark SPARK-22139 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19360.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19360 commit f345aa8487a64a0256c6965bc198ba8842cd0a51 Author: guoxiaolong Date: 2017-09-27T06:58:37Z [SPARK-22139] Remove the variable which is never used in SparkConf.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19469 @ssaavedra , yes I think so. with the pull-in of k8s support, I would guess more configurations need to be added to exclusion rule. With current solution, one by one PR doesn't make so sense. We should either figure out a general solution or refactor this part. Besides, as we moved to structured streaming, do we need to pay more efforts on these issues? @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19360: [SPARK-22139][CORE]Remove the variable which is never us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19360 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145334576 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(pdf_slice, schema=arrow_schema, preserve_index=False) + for pdf_slice in pdf_slices] + +# Verify schema, there will be at least 1 batch from pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) +if schema is not None and schema != schema_from_arrow: +raise ValueError("Supplied schema does not match result from Arrow\nsupplied: " + + "%s\n!=\nfrom Arrow: %s" % (str(schema), str(schema_from_arrow))) --- End diff -- It's okay to fallback with warnings, but I think we should try to adjust types specified by users before that. Otherwise, users can never get the benefit from Arrow when users don't know how to adjust types especially integral types including NaN values. We can split pandas DataFrame into Series once and adjust types during building RecordBatches. I guess we should modify the timestamp values to have timezone for each Series when we support timestamp type anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19419 **[Test build #82876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82876/testReport)** for PR 19419 at commit [`de54313`](https://github.com/apache/spark/commit/de54313479383be54de6bb075afe228617c244f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r145331849 --- Diff: mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.linalg + +import org.json4s.DefaultFormats +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods.{compact, parse => parseJson, render} + +private[ml] object JsonMatrixConverter { + + /** Unique class name for identifying JSON object encoded by this class. */ + val className = "org.apache.spark.ml.linalg.Matrix" --- End diff -- I added this as an identifier, so during loading we know which JSON converter to invoke. Yet I didn't add it for JsonVectorConverter, to avoid breaking old models. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r145333064 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -122,17 +124,33 @@ private[ml] object Param { /** Decodes a param value from JSON. */ def jsonDecode[T](json: String): T = { -parse(json) match { +val jValue = parse(json) +jValue match { case JString(x) => x.asInstanceOf[T] case JObject(v) => val keys = v.map(_._1) -assert(keys.contains("type") && keys.contains("values"), - s"Expect a JSON serialized vector but cannot find fields 'type' and 'values' in $json.") -JsonVectorConverter.fromJson(json).asInstanceOf[T] +if (keys.contains("class")) { + implicit val formats = DefaultFormats + val className = (jValue \ "class").extract[String] + className match { +case JsonMatrixConverter.className => + val checkFields = Array("numRows", "numCols", "values", "isTransposed") + require(checkFields.forall(keys.contains), s"Expect a JSON serialized Matrix" + +s" but cannot find fields ${checkFields.mkString(", ")} in $json.") + JsonMatrixConverter.fromJson(json).asInstanceOf[T] + +case s => throw new SparkException(s"unrecognized class $s in $json") + } +} else { // Vector does not have class info in json + require(keys.contains("type") && keys.contains("values"), s"Expect a JSON serialized" + +s" vector/matrix but cannot find fields 'type' and 'values' in $json.") --- End diff -- /matrix should be removed here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145336060 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(pdf_slice, schema=arrow_schema, preserve_index=False) + for pdf_slice in pdf_slices] + +# Verify schema, there will be at least 1 batch from pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) +if schema is not None and schema != schema_from_arrow: +raise ValueError("Supplied schema does not match result from Arrow\nsupplied: " + + "%s\n!=\nfrom Arrow: %s" % (str(schema), str(schema_from_arrow))) --- End diff -- Sure, I couldn't agree more with ^. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19488 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19488 **[Test build #82877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82877/testReport)** for PR 19488 at commit [`506c410`](https://github.com/apache/spark/commit/506c410eefb5131c6cf50e947ff01e0bda9c28fe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19419 **[Test build #82878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82878/testReport)** for PR 19419 at commit [`b6d4885`](https://github.com/apache/spark/commit/b6d4885e9ad9a03a40b3c28df41d7b263b89369f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19523#discussion_r145340100 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -204,6 +204,7 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate { override def children: Seq[Expression] = value +: list lazy val inSetConvertible = list.forall(_.isInstanceOf[Literal]) + lazy val isListEmpty = list.isEmpty --- End diff -- I am using it to be consistent with the current implementation (see the line above) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19360: [SPARK-22139][CORE]Remove the variable which is never us...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19360 Let's leave this closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19360: [SPARK-22139][CORE]Remove the variable which is n...
Github user guoxiaolongzte closed the pull request at: https://github.com/apache/spark/pull/19360 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19525 **[Test build #82875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82875/testReport)** for PR 19525 at commit [`92f599d`](https://github.com/apache/spark/commit/92f599d3fb7159cfc1694d2c65321dd2abdc4ccc). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case s => throw new SparkException(s\"unrecognized class $s in $json\")` * ` else ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19525 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19242: [CORE][DOC]Add event log conf.
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/19242 @srowen Help to review the code, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19525 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82875/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19242: [CORE][DOC]Add event log conf.
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19242#discussion_r145343263 --- Diff: docs/configuration.md --- @@ -740,6 +740,20 @@ Apart from these, the following properties are also available, and may be useful + spark.eventLog.overwrite + false + +Whether to overwrite any existing files. + + + + spark.eventLog.buffer.kb + 100 + +Buffer size to use when writing to output streams.Buffer size in KB. --- End diff -- Might want to proofread these before pinging for review. You're missing a space before the second sentence, but, why not just: "Buffer size in KB to use when writing to output streams" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145343991 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,124 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(ImageFields[0], StringType(), True), +StructField(ImageFields[1], IntegerType(), False), +StructField(ImageFields[2], IntegerType(), False), +StructField(ImageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(ImageFields[4], StringType(), False), +# bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(ImageFields[5], BinaryType(), False)]), True)) + + +# TODO: generalize to other datatypes and number of channels +def toNDArray(image): +""" +Converts an image to a one-dimensional array + +:param image (object): The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, +buffer=image.data, +strides=(width * nChannels, nChannels, 1)) + + +# TODO: generalize to other datatypes and number of channels +def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]): +""" + --- End diff -- Not a big deal but I'd remove this extra newline. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145342026 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, "CV_8UC4" -> 24, +"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, "CV_8SC4" -> 25, +"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, "CV_16UC4" -> 26, +"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, "CV_16SC4" -> 27, +"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, "CV_32SC4" -> 28, +"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, "CV_32FC4" -> 29, +"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, "CV_64FC4" -> 30 + ) + + /** + * Schema for the image column: Row(String, Int, Int, Int, Array[Byte]) + */ + private val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * :: Experimental :: + * Check if the DataFrame column contains images (i.e. has ImageSchema) + * + * @param df DataFrame + * @param column Column name + * @return True if the given column matches the image schema + */ + def isImageColumn(df: DataFrame, column: String): Boolean = +df.schema(column).dataType == columnSchema + + /*
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145342355 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + + /** + * Sets a value of spark recursive flag. + * If value is a None, it unsets the flag. + * + * @param value value to set + * @param spark existing spark session + * @return previous value of this flag + */ + def setRecursiveFlag(value: Option[String], spark: SparkSession): Option[String] = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) + +value match { + case Some(v) => hadoopConf.set(flagName, v) + case None => hadoopConf.unset(flagName) +} + +old + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { + val random = new Random() + + // Ratio of files to be read from disk + var sampleRatio: Double = 1 + + override def setConf(conf: Configuration): Unit = { +if (conf != null) { + sampleRatio = conf.getDouble(SamplePathFilter.ratioParam, 1) +} + } + + override def accept(path: Path): Boolean = { +// Note: checking fileSystem.isDirectory is very slow here, so we use basic rules instead +!SamplePathFilter.isFile(path) || random.nextDouble() < sampleRatio + } +} + +private object SamplePathFilter { + val ratioParam = "sampleRatio" + + def isFile(path: Path): Boolean = FilenameUtils.getExtension(path.toString) != "" + + /** + * Sets HDFS PathFilter + * + * @param value Filter class that is passed to HDFS + * @param sampleRatio Fraction of the files that the filter picks + * @param spark Existing Spark session + * @return Returns the previous HDFS path filter + */ + def setPathFilter(value: Option[Class[_]], +sampleRatio: Double, +spark: SparkSession): Option[Class[_]] = { --- End diff -- nit: ```scala def setPathFilter( value: Option[Class[_]], sampleRatio: Double, spark: SparkSession): Option[Class[_]] = { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145341631 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,124 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"] --- End diff -- `ImageFields ` -> `imageFields` (to match with Scala one) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145344124 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,124 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(ImageFields[0], StringType(), True), +StructField(ImageFields[1], IntegerType(), False), +StructField(ImageFields[2], IntegerType(), False), +StructField(ImageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(ImageFields[4], StringType(), False), +# bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(ImageFields[5], BinaryType(), False)]), True)) + + +# TODO: generalize to other datatypes and number of channels +def toNDArray(image): +""" +Converts an image to a one-dimensional array + +:param image (object): The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, +buffer=image.data, +strides=(width * nChannels, nChannels, 1)) + + +# TODO: generalize to other datatypes and number of channels +def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]): +""" + +Converts a one-dimensional array to a two-dimensional image + +:param array (array): The array to convert to image +:param origin (str): Path to the image +:param mode (str): OpenCV compatible type + +:rtype object: Two dimensional image + +.. versionadded:: 2.3.0 +""" +data = bytearray(array.astype(dtype=np.uint8).ravel()) +height = array.shape[0] +width = array.shape[1] +nChannels = array.shape[2] +# Creating new Row with _create_row(), because Row(name = value, ... ) +# orders fields by name, which conflicts with expected schema order +# when the new DataFrame is created by UDF +return _create_row(ImageFields, + [origin, height, width, nChannels, mode, data]) + + +def readImages(path, recursive=False, numPartitions=0, + dropImageFailures=False, sampleRatio=1.0): +""" +Reads the directory of images from the local or remote source. --- End diff -- tiny nit: I'd add a newline between this description and `:param` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19242: [CORE][DOC]Add event log conf.
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/19242#discussion_r145346521 --- Diff: docs/configuration.md --- @@ -740,6 +740,20 @@ Apart from these, the following properties are also available, and may be useful + spark.eventLog.overwrite + false + +Whether to overwrite any existing files. + + + + spark.eventLog.buffer.kb + 100 + +Buffer size to use when writing to output streams.Buffer size in KB. --- End diff -- I have fixed the description and correction unit. Please check org.apache.spark.internal.config#EVENT_LOG_OUTPUT_BUFFER_SIZE private[spark] val EVENT_LOG_OUTPUT_BUFFER_SIZE = ConfigBuilder("spark.eventLog.buffer.kb") .bytesConf(ByteUnit.KiB) .createWithDefaultString("100k") --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19526: [SPARK-22014][SQL] removed TypeCheckFailure: slid...
GitHub user SimonUzL opened a pull request: https://github.com/apache/spark/pull/19526 [SPARK-22014][SQL] removed TypeCheckFailure: slide duration <= windowDuration It is possible to create sampling windows in Spark Streaming, where the duration of the window is smaller than the slide, but it throws a TypeCheckFailure in Spark SQL. I think there should be no difference (duration and slide) in a "Spark Streaming window" and a "Spark SQL window" function. @brkyvz You can merge this pull request into a Git repository by running: $ git pull https://github.com/SimonUzL/spark sampling_window Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19526.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19526 commit 295872b23af94f514263ec7512a572c88eb172db Author: Simon Schiff Date: 2017-10-17T08:52:42Z removed TypeCheckFailure: slide duration must be less than or equal to the windowDuration --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19526: [SPARK-22014][SQL] removed TypeCheckFailure: slide durat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19526 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user krishna-pandey commented on the issue: https://github.com/apache/spark/pull/19419 @srowen @rxin Made changes to enable the X-Content-Type-Options and X-XSS-Protection values by default. Please review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/19527 [SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator ## What changes were proposed in this pull request? This patch adds a new class `OneHotEncoderEstimator` which extends `Estimator`. The `fit` method returns `OneHotEncoderModel`. Common methods between existing `OneHotEncoder` and new `OneHotEncoderEstimator`, such as transforming schema, are extracted and put into `OneHotEncoderCommon`. ### Multi-column support `OneHotEncoderEstimator` adds simpler multi-column support because it is new API and can be free from backward compatibility. ### handleInvalid Param support `OneHotEncoderEstimator` supports `handleInvalid` Param. It supports `error` and `skip`. Note that `skip` can't be used at the same time with `dropLast` as true. Because they will conflict in encoded vector. ## How was this patch tested? Added new test suite `OneHotEncoderEstimatorSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-13030 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19527 commit 8fd4677fd0e729d99d8777010e78bb5cfea3cf86 Author: Liang-Chi Hsieh Date: 2017-10-18T07:31:32Z Add OneHotEncoderEstimator and related tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #82879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82879/testReport)** for PR 19527 at commit [`8fd4677`](https://github.com/apache/spark/commit/8fd4677fd0e729d99d8777010e78bb5cfea3cf86). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, no problem. The test build seems to be aborted. What's wrong? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19522 **[Test build #3952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3952/testReport)** for PR 19522 at commit [`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18924 **[Test build #82880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82880/testReport)** for PR 18924 at commit [`a81dae5`](https://github.com/apache/spark/commit/a81dae574f2085ec390effd1b9b1962970f00239). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #82879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82879/testReport)** for PR 19527 at commit [`8fd4677`](https://github.com/apache/spark/commit/8fd4677fd0e729d99d8777010e78bb5cfea3cf86). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OneHotEncoderEstimator @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)` * ` class OneHotEncoderModelWriter(instance: OneHotEncoderModel) extends MLWriter ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82879/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r145371704 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit} /** Params for linear SVM Classifier. */ private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol - with HasAggregationDepth with HasThreshold { + with HasAggregationDepth with HasThreshold with HasSolver { + + /** + * Specifies the loss function. Currently "hinge" and "squared_hinge" are supported. + * "hinge" is the standard SVM loss (a.k.a. L1 loss) while "squared_hinge" is the square of + * the hinge loss (a.k.a. L2 loss). + * + * @see https://en.wikipedia.org/wiki/Hinge_loss";>Hinge loss (Wikipedia) + * + * @group param + */ + @Since("2.3.0") + final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " + +"function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.", +(s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT))) --- End diff -- The `isValid` function you can use `ParamValidators.inArray[String](supportedLosses))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r145369694 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") ( @Since("2.2.0") object LinearSVC extends DefaultParamsReadable[LinearSVC] { + /** String name for Limited-memory BFGS. */ + private[classification] val LBFGS: String = "l-bfgs".toLowerCase(Locale.ROOT) + + /** String name for Orthant-Wise Limited-memory Quasi-Newton. */ + private[classification] val OWLQN: String = "owlqn".toLowerCase(Locale.ROOT) + + /* Set of optimizers that LinearSVC supports */ + private[classification] val supportedSolvers = Array(LBFGS, OWLQN) + + /** String name for Hinge Loss. */ + private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) + + /** String name for Squared Hinge Loss. */ + private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) --- End diff -- Why need `.toLowerCase(Locale.ROOT)` here ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r145371903 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") ( @Since("2.2.0") object LinearSVC extends DefaultParamsReadable[LinearSVC] { + /** String name for Limited-memory BFGS. */ + private[classification] val LBFGS: String = "l-bfgs".toLowerCase(Locale.ROOT) + + /** String name for Orthant-Wise Limited-memory Quasi-Newton. */ + private[classification] val OWLQN: String = "owlqn".toLowerCase(Locale.ROOT) + + /* Set of optimizers that LinearSVC supports */ + private[classification] val supportedSolvers = Array(LBFGS, OWLQN) + + /** String name for Hinge Loss. */ + private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) + + /** String name for Squared Hinge Loss. */ + private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) + + /* Set of loss function that LinearSVC supports */ + private[classification] val supportedLoss = Array(HINGE, SQUARED_HINGE) --- End diff -- supportedLoss ==> supportedLosses --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19488 **[Test build #82877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82877/testReport)** for PR 19488 at commit [`506c410`](https://github.com/apache/spark/commit/506c410eefb5131c6cf50e947ff01e0bda9c28fe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82877/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19419 **[Test build #82876 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82876/testReport)** for PR 19419 at commit [`de54313`](https://github.com/apache/spark/commit/de54313479383be54de6bb075afe228617c244f2). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19419 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82876/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19419 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19419 **[Test build #82878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82878/testReport)** for PR 19419 at commit [`b6d4885`](https://github.com/apache/spark/commit/b6d4885e9ad9a03a40b3c28df41d7b263b89369f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19419 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82878/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19419 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19510: [SPARK-22292][Mesos] Added spark.mem.max support ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19510#discussion_r145383146 --- Diff: resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala --- @@ -152,6 +152,23 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite assert(cpus == maxCores) } + test("mesos does not acquire more than spark.mem.max") { +setBackend(Map("spark.mem.max" -> "2g", + "spark.executor.memory" -> "1g", + "spark.executor.cores" -> "1")) + +val executorMemory = backend.executorMemory(sc) + --- End diff -- remove space --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18924 **[Test build #82880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82880/testReport)** for PR 18924 at commit [`a81dae5`](https://github.com/apache/spark/commit/a81dae574f2085ec390effd1b9b1962970f00239). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18924 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18924 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82880/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 ping @jkbradley. Anyway, tests are passed now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19510: [SPARK-22292][Mesos] Added spark.mem.max support for Mes...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/19510 @windkit there is an open issue here: https://issues.apache.org/jira/browse/SPARK-22133 Could you please add documentation of the new property along with the old ones. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19510: [SPARK-22292][Mesos] Added spark.mem.max support ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19510#discussion_r145389586 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -64,6 +64,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( private val MAX_SLAVE_FAILURES = 2 private val maxCoresOption = conf.getOption("spark.cores.max").map(_.toInt) + private val maxMemOption = conf.getOption("spark.mem.max").map(Utils.memoryStringToMb) --- End diff -- Can we defend against minimum values? For example default executor memory is 1.4MB. We could calculate the value returned by MesosSchedulerUtils.executorMemory. I don't think these values calculated in canLaunchTask ever change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19374: [SPARK-22145][MESOS] fix supervise with checkpoin...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19374#discussion_r145391350 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala --- @@ -135,22 +135,24 @@ private[spark] class MesosClusterScheduler( private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false) private val schedulerState = engineFactory.createEngine("scheduler") private val stateLock = new Object() + // Keyed by submission id private val finishedDrivers = new mutable.ArrayBuffer[MesosClusterSubmissionState](retainedDrivers) private var frameworkId: String = null - // Holds all the launched drivers and current launch state, keyed by driver id. + // Holds all the launched drivers and current launch state, keyed by submission id. private val launchedDrivers = new mutable.HashMap[String, MesosClusterSubmissionState]() // Holds a map of driver id to expected slave id that is passed to Mesos for reconciliation. // All drivers that are loaded after failover are added here, as we need get the latest - // state of the tasks from Mesos. + // state of the tasks from Mesos. Keyed by task Id. private val pendingRecover = new mutable.HashMap[String, SlaveID]() - // Stores all the submitted drivers that hasn't been launched. + // Stores all the submitted drivers that hasn't been launched, keyed by submission id private val queuedDrivers = new ArrayBuffer[MesosDriverDescription]() - // All supervised drivers that are waiting to retry after termination. + // All supervised drivers that are waiting to retry after termination, keyed by submission id private val pendingRetryDrivers = new ArrayBuffer[MesosDriverDescription]() private val queuedDriversState = engineFactory.createEngine("driverQueue") private val launchedDriversState = engineFactory.createEngine("launchedDrivers") private val pendingRetryDriversState = engineFactory.createEngine("retryList") + private final val RETRY_ID = "-retry-" --- End diff -- np will update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/19374 @srowen Could I get a merge or a review pls? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19527 cc @MLnick @WeichenXu123 @jkbradley This adds a new class `OneHotEncoderEstimator` which extends `Estimator`. Please review this when you can. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19374 **[Test build #82881 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82881/testReport)** for PR 19374 at commit [`0699917`](https://github.com/apache/spark/commit/06999177331b7323813eca3c06d6a9a55e054f7d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r145397272 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -380,7 +389,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( } else { declineOffer( driver, - offer) + offer, --- End diff -- There is a debug message if canLaunchTask fails. I will create a jira and improve logging across the module. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user skonto commented on the issue: https://github.com/apache/spark/pull/19390 @srowen could I get a review or a merge pls? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19390 **[Test build #82882 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82882/testReport)** for PR 19390 at commit [`4c51a1f`](https://github.com/apache/spark/commit/4c51a1fdb5638adba3b9bb9506ba3478cea9385f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19390 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82882/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19390 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19390 **[Test build #82882 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82882/testReport)** for PR 19390 at commit [`4c51a1f`](https://github.com/apache/spark/commit/4c51a1fdb5638adba3b9bb9506ba3478cea9385f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19521 Oh, I confused with what I'm watching in these days. For your example, Parquet also doesn't support. We may create an issue for both Parquet/ORC on empty schema . ```scala scala> val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty) scala> val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty)) scala> dfNoCols.write.format("parquet").saveAsTable("px") 17/10/18 05:46:17 ERROR Utils: Aborting task org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema { } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19521 LGTM too BTW. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19488 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19521 Empty schema path probably related with this IIRC (not double checked): https://github.com/apache/spark/blob/cca945b6aa679e61864c1cabae91e6ae7703362e/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L52-L58 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19488 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19524 @holdenk, hm, should I maybe make add this to `run-tests.py` and `run-tests-jenkins.py`? I wasn't sure where I should put this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r145410797 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && + !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && + children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty --- End diff -- this is check is unnecessary, this is a `LeafExecNode` so it has no children. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r145411743 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && + !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && + children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty + } + + private val columnIndices = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + private val relationSchema = relation.schema.toArray + + private lazy val columnarBatchSchema = new StructType(columnIndices.map(i => relationSchema(i))) + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) --- End diff -- Can we reuse the `OnHeapColumnVector` for the cached batches? It's a little inefficient to create one column vector for each cached batch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19513: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19513 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19514 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19513: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19513 **[Test build #82884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82884/testReport)** for PR 19513 at commit [`96b465b`](https://github.com/apache/spark/commit/96b465b33fe8490d53d7a45f48d658958a6b43cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19514 **[Test build #82883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82883/consoleFull)** for PR 19514 at commit [`7ec2dc7`](https://github.com/apache/spark/commit/7ec2dc70ce26ff754c0ea38f3cf9964d67bc62f7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user wesm commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145415011 --- Diff: python/pyspark/sql/types.py --- @@ -1619,11 +1619,39 @@ def to_arrow_type(dt): arrow_type = pa.decimal(dt.precision, dt.scale) elif type(dt) == StringType: arrow_type = pa.string() +elif type(dt) == DateType: +arrow_type = pa.date32() +elif type(dt) == TimestampType: +# Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read +arrow_type = pa.timestamp('us', tz='UTC') else: raise TypeError("Unsupported type in conversion to Arrow: " + str(dt)) return arrow_type +def _check_dataframe_localize_timestamps(df): +""" Convert timezone aware timestamps to timezone-naive in local time +""" +from pandas.types.common import is_datetime64tz_dtype --- End diff -- I am not sure this is the right public API, @jreback could you advise? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user wesm commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145415575 --- Diff: python/pyspark/sql/types.py --- @@ -1619,11 +1619,39 @@ def to_arrow_type(dt): arrow_type = pa.decimal(dt.precision, dt.scale) elif type(dt) == StringType: arrow_type = pa.string() +elif type(dt) == DateType: +arrow_type = pa.date32() +elif type(dt) == TimestampType: +# Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read +arrow_type = pa.timestamp('us', tz='UTC') else: raise TypeError("Unsupported type in conversion to Arrow: " + str(dt)) return arrow_type +def _check_dataframe_localize_timestamps(df): +""" Convert timezone aware timestamps to timezone-naive in local time +""" +from pandas.types.common import is_datetime64tz_dtype +for column, series in df.iteritems(): +# TODO: handle nested timestamps? +if is_datetime64tz_dtype(series.dtype): +df[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None) +return df + + +def _check_series_convert_timestamps_internal(s): +""" Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage +""" +from pandas.types.common import is_datetime64_dtype --- End diff -- API --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user wesm commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145415564 --- Diff: python/pyspark/sql/types.py --- @@ -1619,11 +1619,39 @@ def to_arrow_type(dt): arrow_type = pa.decimal(dt.precision, dt.scale) elif type(dt) == StringType: arrow_type = pa.string() +elif type(dt) == DateType: +arrow_type = pa.date32() +elif type(dt) == TimestampType: +# Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read +arrow_type = pa.timestamp('us', tz='UTC') else: raise TypeError("Unsupported type in conversion to Arrow: " + str(dt)) return arrow_type +def _check_dataframe_localize_timestamps(df): +""" Convert timezone aware timestamps to timezone-naive in local time +""" +from pandas.types.common import is_datetime64tz_dtype +for column, series in df.iteritems(): +# TODO: handle nested timestamps? +if is_datetime64tz_dtype(series.dtype): +df[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None) +return df + + +def _check_series_convert_timestamps_internal(s): +""" Convert a tz-naive timestamp in local tz to UTC normalized for Spark internal storage +""" +from pandas.types.common import is_datetime64_dtype +# TODO: handle nested timestamps? +if is_datetime64_dtype(s.dtype): +# NOTE: convert to 'us' with astype here, unit is ignored in `from_pandas` see ARROW-1680 +return s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC').values.astype('datetime64[us]') --- End diff -- Can you create a follow up JIRA to fix this as soon as Arrow 0.8.0 lands? I will make sure that this gets fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user wesm commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145414587 --- Diff: python/pyspark/serializers.py --- @@ -223,12 +224,13 @@ def _create_batch(series): # If a nullable integer series has been promoted to floating point with NaNs, need to cast # NOTE: this is not necessary with Arrow >= 0.7 def cast_series(s, t): -if t is None or s.dtype == t.to_pandas_dtype(): +if t is None or s.dtype == t.to_pandas_dtype() or type(t) == pa.TimestampType: --- End diff -- Here `TimestampType` was removed from the pyarrow namespace since 0.7.0 but I opened a JIRA to add it back https://issues.apache.org/jira/browse/ARROW-1683 We created a new `pyarrow.types` API which should replace these checks with `pa.types.is_timestamp(t)` but that requires Arrow 0.8.0. I would recommend making this transition before Spark 2.3.0. Timeline for Arrow 0.8.0 is early November --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user wesm commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145415365 --- Diff: python/pyspark/sql/types.py --- @@ -1619,11 +1619,39 @@ def to_arrow_type(dt): arrow_type = pa.decimal(dt.precision, dt.scale) elif type(dt) == StringType: arrow_type = pa.string() +elif type(dt) == DateType: +arrow_type = pa.date32() +elif type(dt) == TimestampType: +# Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read +arrow_type = pa.timestamp('us', tz='UTC') else: raise TypeError("Unsupported type in conversion to Arrow: " + str(dt)) return arrow_type +def _check_dataframe_localize_timestamps(df): +""" Convert timezone aware timestamps to timezone-naive in local time +""" +from pandas.types.common import is_datetime64tz_dtype +for column, series in df.iteritems(): +# TODO: handle nested timestamps? +if is_datetime64tz_dtype(series.dtype): +df[column] = series.dt.tz_convert('tzlocal()').dt.tz_localize(None) --- End diff -- @jreback is this the best route to obtain tz-naive datetimes in localtime? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19521 Thank you for review, @HyukjinKwon . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply()...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19505 So, looks we are good to go? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r145424106 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && + !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && + children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty --- End diff -- Sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r145424996 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && + !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && + children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty + } + + private val columnIndices = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + private val relationSchema = relation.schema.toArray + + private lazy val columnarBatchSchema = new StructType(columnIndices.map(i => relationSchema(i))) + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) --- End diff -- I agree that we can improve efficiency if we can reuse the `OnHeapColumnVector`. I think that it is not easy to reuse the `OnHeapColumnVector` between different cached batches. IIUC there is no point to know a cached batch will not be referenced. We rely the management of the lifetime on GC by creating `OnHeapColumnVector` every time. If we reuse the `OnHeapColumnVector` (i.e. keep a reference to `OnHeapColumnVector`), GC will not dispose `OnHeapColumnVector` even if the generated code will not use the `OnHeapColumnVector`. It means that uncompressed (huge) data lives for a long time. If we know the point where a cache batch will not be referenced, we could set null to `data` in `OnHeapColumnVector`. Thus, I currently create `OnHeapColumnVector`. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18664 Sorry for joining this discussion so late. My 2 cents: Spark SQL only has timestamp without timezone, and internally we use a java long to store the value of a timestamp. This is OK as timestamp is the milliseconds since Unix epoch which is timezone independent. However the timezone information is needed when doing some calculation like adding a day to a timestamp or converting a timestamp to a string. At this time, Spark SQL will pick SESSION_LOCAL_TIMEZONE instead of java local timezone(`TimeZone.getDefault()`). BTW if Spark supports timestamp with timezone, then the timezone carried by the timestamp value should be picked. One special case is `Dataset.collect`, Spark converts internal row to external row, and thus converts the long value of timestamp column to `java.sql.Timestamp`. At this point, timezone information is not needed as `java.sql.Timestamp` is also timezone independent. Ideally users should use `SimpleDateFormat` to print `java.sql.Timestamp` with a specific timezone, but if they just call `java.sql.Timestamp.toString`, the java local timezone is picked and they will get inconsistent result if the java local timezone is different from spark SESSION_LOCAL_TIMEZONE. For pyspark `DataFrame.collect`, we have a similar issue. When we convert the long value of timestamp column to python timestamp, we use `datetime.datetime.fromtimestamp` which respects the python local timezone. So users will also get inconsistent result if the python local timezone is different from spark SESSION_LOCAL_TIMEZONE. `DataFrame.toPandas` is a little different. I think we are able to set the timezone of pandas DataFrame to spark SESSION_LOCAL_TIMEZONE instead of sticking with python local timezone. We can create a new JIRA to resolve it. Looking back to this PR, I think we should follow the existing rule: we should write tz-naive timestamp(the long value in Spark SQL) to arrow. Then at the python side we create pandas dataframe with this tz-naive timestamp. This behavior is exactly same with non-arrow-optimized `toPandas`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org