[GitHub] spark pull request #22518: [SPARK-25482][SQL] ReuseSubquery can be useless w...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22518#discussion_r232558384 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala --- @@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext { assert(getNumSortsInQuery(query5) == 1) } } + + test("SPARK-25482: Reuse same Subquery in order to execute it only once") { +withTempView("t1", "t2") { + sql("create temporary view t1(a int) using parquet") + sql("create temporary view t2(b int) using parquet") + val plan = sql("select * from t2 where b > (select max(a) from t1)") --- End diff -- sorry it has been a long time and I don't quite remember the context. What was the problem we are trying to fix? This test looks nothing related to subquery reuse. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22887: [SPARK-25880][CORE] user set's hadoop conf should not ov...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22887 looks reasonable, cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22693#discussion_r232556859 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -115,26 +116,45 @@ class ResolveHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan] { class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case filterPlan @ Filter(_, SubqueryAlias(_, relation: HiveTableRelation)) => + val predicates = PhysicalOperation.unapply(filterPlan).map(_._2).getOrElse(Nil) + computeTableStats(relation, predicates) case relation: HiveTableRelation if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty => - val table = relation.tableMeta - val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { -try { - val hadoopConf = session.sessionState.newHadoopConf() - val tablePath = new Path(table.location) - val fs: FileSystem = tablePath.getFileSystem(hadoopConf) - fs.getContentSummary(tablePath).getLength -} catch { - case e: IOException => -logWarning("Failed to get table size from hdfs.", e) -session.sessionState.conf.defaultSizeInBytes -} - } else { -session.sessionState.conf.defaultSizeInBytes + computeTableStats(relation) + } + + private def computeTableStats( + relation: HiveTableRelation, + predicates: Seq[Expression] = Nil): LogicalPlan = { +val table = relation.tableMeta +val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) { + try { +val hadoopConf = session.sessionState.newHadoopConf() +val tablePath = new Path(table.location) +val fs: FileSystem = tablePath.getFileSystem(hadoopConf) +BigInt(fs.getContentSummary(tablePath).getLength) + } catch { +case e: IOException => + logWarning("Failed to get table size from hdfs.", e) + getSizeInBytesFromTablePartitions(table.identifier, predicates) } +} else { + getSizeInBytesFromTablePartitions(table.identifier, predicates) +} +val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = sizeInBytes))) +relation.copy(tableMeta = withStats) + } - val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes - relation.copy(tableMeta = withStats) + private def getSizeInBytesFromTablePartitions( + tableIdentifier: TableIdentifier, + predicates: Seq[Expression] = Nil): BigInt = { +session.sessionState.catalog.listPartitionsByFilter(tableIdentifier, predicates) match { --- End diff -- How come https://github.com/apache/spark/pull/22743 solves this problem? That PR targets to invalidate cache when configurations are changed. This PR targets to compute stats from HDFS when they are not available. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Aggregate expressions shouldn'...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r232556359 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1556,6 +1556,20 @@ class DatasetSuite extends QueryTest with SharedSQLContext { df.where($"city".contains(new java.lang.Character('A'))), Seq(Row("Amsterdam"))) } + + test("SPARK-25942: typed aggregation on primitive type") { +val ds = Seq(1, 2, 3).toDS() + +val agg = ds.groupByKey(_ >= 2) + .agg(sum("value").as[Long], sum($"value" + 1).as[Long]) --- End diff -- I think we should not make decisions for users. For untyped APIs, users can refer the grouping columns in the aggregate expressions, I think the typed APIs should be same. For this particular case, currrently spark allows grouping columns inside aggregate functions, so the `value` here is indeed ambiguous. There is nothing we can do, but fail and ask users to add alias. BTW, we should check other databases and see if "grouping columns inside aggregate functions" should be allowed, --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23005 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23005 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22429 This is hard to review, do you mean we should add `maxFields: Option[Int]` to all the string related methods? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22976: [SPARK-25974][SQL]Optimizes Generates bytecode for order...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22976 LGTM except one comment, cc @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22955: [SPARK-25949][SQL] Add test for PullOutPythonUDFI...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22955 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22976: [SPARK-25974][SQL]Optimizes Generates bytecode fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22976#discussion_r232552336 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -68,62 +68,55 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR genComparisons(ctx, ordering) } + /** + * Creates the variables for ordering based on the given order. + */ + private def createOrderKeys( +ctx: CodegenContext, --- End diff -- 4 space identation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22955: [SPARK-25949][SQL] Add test for PullOutPythonUDFInJoinCo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22955 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22954 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22954 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98713/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22954 **[Test build #98713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98713/testReport)** for PR 22954 at commit [`d9d9f98`](https://github.com/apache/spark/commit/d9d9f982d26a5dd2141515e0c9089243b7b93554). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550860 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1813,6 +1817,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { val path = dir.getCanonicalPath primitiveFieldAndType .toDF("value") +.repartition(1) --- End diff -- why is the `repartition` required? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550733 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1115,6 +1115,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { Row(null, null, null), Row(null, null, null), Row(null, null, null), +Row(null, null, null), --- End diff -- so for json data source, previous behavior is, we would skip the row even it's in PERMISSIVE mode. Shall we clearly mention it in the migration guide? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550502 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -550,15 +550,23 @@ case class JsonToStructs( s"Input schema ${nullableSchema.catalogString} must be a struct, an array or a map.") } - // This converts parsed rows to the desired output by the given schema. @transient - lazy val converter = nullableSchema match { -case _: StructType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null -case _: ArrayType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null -case _: MapType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null + private lazy val castRow = nullableSchema match { +case _: StructType => (row: InternalRow) => row +case _: ArrayType => (row: InternalRow) => row.getArray(0) +case _: MapType => (row: InternalRow) => row.getMap(0) + } + + // This converts parsed rows to the desired output by the given schema. + private def convertRow(rows: Iterator[InternalRow]) = { +if (rows.hasNext) { + val result = rows.next() + // JSON's parser produces one record only. + assert(!rows.hasNext) + castRow(result) +} else { + throw new IllegalArgumentException("Expected one row from JSON parser.") --- End diff -- This can only happen when we have a bug, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22966#discussion_r232550388 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala --- @@ -0,0 +1,226 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.Random + +import org.apache.spark.SparkConf +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.types._ + +/** + * Benchmark to measure Avro read performance. + * {{{ + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + *--jars , + * 2. build/sbt "avro/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain " + * Results will be written to "benchmarks/AvroReadBenchmark-results.txt". + * }}} + */ +object AvroReadBenchmark extends BenchmarkBase with SQLHelper { --- End diff -- @dongjoon-hyun OK, then I think this one is ready. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550186 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. + - In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. --- End diff -- > In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s Shall we update this? According to what you said, JSON data source can't produce null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22998: [SPARK-26001][SQL]Reduce memory copy when writing decima...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22998 I think this is wrong. We have to zero out the bytes even writing a null decimal, so that 2 unsafe rows with same values(including null values) are exactly same(in binary format). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232549234 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- Ah, yea, I switched this to deprecate it for now. I was a bit curious about that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232549062 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- hmm, so R prior to version 3.4 is just deprecated, not dropped in in Spark 3.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232549053 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- Hm .. I was thinking we could change them when we actually drop the support. Technically it does support 3.1+ yet although 3.1, 3.2, and 3.3 are deprecated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23012#discussion_r232548211 --- Diff: docs/index.md --- @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 is deprecated as of Spark 3.0. --- End diff -- 3.1+ -> 3.4+? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22974 **[Test build #98719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98719/testReport)** for PR 22974 at commit [`0c529fb`](https://github.com/apache/spark/commit/0c529fb7830b78c45b3f2a98046da9fa3061185f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22974 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22974 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4945/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23005 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23005 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98711/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23005 **[Test build #98711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98711/testReport)** for PR 23005 at commit [`4545977`](https://github.com/apache/spark/commit/45459776f2dd08f8180e152aae2702dfed190ed9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23012 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23012 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98718/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23012 **[Test build #98718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98718/testReport)** for PR 23012 at commit [`dc2dbd9`](https://github.com/apache/spark/commit/dc2dbd923a1396ca5a7a950df35da57cc70c2ab8). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23009: SPARK-26011: pyspark app with "spark.jars.packages" conf...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/23009 @shanyu can you update the name as [SPARK-26011][CORE][PYSPARK] according to the guidelines? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22974 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98712/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22974 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22974 **[Test build #98712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98712/testReport)** for PR 22974 at commit [`7e97e45`](https://github.com/apache/spark/commit/7e97e450e110b9cdbe3610ee03e1ea65d5575d63). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22939#discussion_r232540931 --- Diff: R/pkg/R/functions.R --- @@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{schema_of_json}: Parses a JSON string and infers its schema in DDL format. +#' +#' @rdname column_collection_functions +#' @aliases schema_of_json schema_of_json,characterOrColumn-method +#' @examples +#' +#' \dontrun{ +#' json <- '{"name":"Bob"}' +#' df <- sql("SELECT * FROM range(1)") +#' head(select(df, schema_of_json(json)))} +#' @note schema_of_json since 3.0.0 +setMethod("schema_of_json", signature(x = "characterOrColumn"), + function(x, ...) { +if (class(x) == "character") { + col <- callJStatic("org.apache.spark.sql.functions", "lit", x) +} else { + col <- x@jc --- End diff -- Hmm .. do you mind if we go ahead for this one and talk later within 3.0? I think we're going to deal with this problem within 3.0 if I am not mistaken. I need to make one followup after this anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23012 **[Test build #98718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98718/testReport)** for PR 23012 at commit [`dc2dbd9`](https://github.com/apache/spark/commit/dc2dbd923a1396ca5a7a950df35da57cc70c2ab8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23012 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22998: [SPARK-26001][SQL]Reduce memory copy when writing decima...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/22998 @kiszk thank you for review it. - when writing null decimalsï¼ ``` OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel iter length 1048576: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative before PR (input == null) 51 / 56 20.4 49.0 1.0X after PR (input == null) 8 /9125.2 8.0 6.1X ``` - when writing non-null decimals ``` OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel iter length 1048576: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative before PR (input != null) 52 / 53 20.3 49.2 1.0X after PR (input != null)54 / 56 19.3 51.7 1.0X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23012 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4944/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 adding @srowen too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23012 Tests probably will fail since it produces warnings. cc @felixcheung. @shaneknapp, @viirya, @shivaram, @falaki, @mengxr, @yanboliang FYI. This PR is made per http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-SparkR-CRAN-feasibility-check-server-problem-td25605.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22721 **[Test build #98717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98717/testReport)** for PR 22721 at commit [`1e62a24`](https://github.com/apache/spark/commit/1e62a24bba8aaa949f3481ae3befe2db5c286edc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22998: [SPARK-26001][SQL]Reduce memory copy when writing decima...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/22998 @mgaido91 thank you for review it. I added a test case to test "write a decimal with 16 bytes and then one with less than 8". then the current change the remaining 8 bytes would not dirty. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23012 [SPARK-26014][R] Deprecate R prior to version 3.4 in SparkR ## What changes were proposed in this pull request? This PR proposes to bump up the minimum versions of R from 3.1 to 3.4. R version. 3.1.x is too old. It's released 4.5 years ago. R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0, deprecating lower versions, bumping up R to 3.4 might be reasonable option. It should be good to deprecate and drop < R 3.4 support. If we think about the practice, nothing particular is required within R codes as far as I can tell, except: 1. https://github.com/apache/spark/blob/master/R/pkg/src-native/string_hash_code.c 2. `env` becomes immutable but in some low versions they are mutable ... if I remember correctly .. shouldn't be a big deal in SparkR side. 3. We will need to upgrade Jenkins's R version to 3.4, which mean we're not going to test 3.1 R version - this should be okay because we're already not testing R 3.2, 3.3 and 3.4. We test 3.5 in Appveyor, and 3.1 in Jenkins. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26014 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23012.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23012 commit dc2dbd923a1396ca5a7a950df35da57cc70c2ab8 Author: hyukjinkwon Date: 2018-11-12T05:39:14Z Deprecate R prior to version 3.4 in SparkR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22721 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22721 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4943/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23009: SPARK-26011: pyspark app with "spark.jars.packages" conf...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/23009 Jenkins test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23009: SPARK-26011: pyspark app with "spark.jars.packages" conf...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/23009 LGTM, nice find --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23011: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23011 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4942/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23011: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23011 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4941/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23011: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23011 **[Test build #98716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98716/testReport)** for PR 23011 at commit [`b94d04a`](https://github.com/apache/spark/commit/b94d04ac80052ed50830239b06a08bf5b07603e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23010 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23011: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23011 cc @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23011: [SPARK-26013][R][BUILD] Upgrade R tools version f...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23011 [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4.0 to 3.5.1 in AppVeyor build ## What changes were proposed in this pull request? R tools 3.5.1 is released few months ago. Spark currently uses 3.4.0. We should better upgrade in AppVeyor. ## How was this patch tested? AppVeyor builds. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26013 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23011.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23011 commit b94d04ac80052ed50830239b06a08bf5b07603e6 Author: hyukjinkwon Date: 2018-11-12T05:02:23Z Upgrade R tools version to 3.5.1 in AppVeyor build --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23010 **[Test build #98715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98715/testReport)** for PR 23010 at commit [`1f18e27`](https://github.com/apache/spark/commit/1f18e2786a26eb64c52925d8ecff2d6a2295ca16). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23010: [SPARK-26012][SQL]Null and '' values should not c...
GitHub user eatoncys opened a pull request: https://github.com/apache/spark/pull/23010 [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types ## What changes were proposed in this pull request? Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. For example, the test bellow will fail before this PR: test("Null and '' values should not cause dynamic partition failure of string types") { withTable("t1", "t2") { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" + " from t1").write.partitionBy("p").saveAsTable("t2") checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null))) } } The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'. This PR adds exception protection to file conflicts, renaming the file when files conflict. (Please fill in changes proposed in this fix) ## How was this patch tested? New added test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eatoncys/spark dynamicPartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23010.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23010 commit 1f18e2786a26eb64c52925d8ecff2d6a2295ca16 Author: 10129659 Date: 2018-11-12T04:41:53Z Null and '' values should not cause dynamic partition failure of string types --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22962 cc @jiangxb1987 @MrBago --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22962 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22962 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98714/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22962 **[Test build #98714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98714/testReport)** for PR 22962 at commit [`02555b8`](https://github.com/apache/spark/commit/02555b8fbdf85c3f2b5a92420479c168e14b573c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @HyukjinKwon Thank you for your comments. Yes, this is only valid when upgrade Spark 2.3 to 2.4. I will do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22989: [SPARK-25986][Build] Banning throw new OutOfMemoryErrors
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22989 Sorry for late reply, great thanks for all reviewer's advise, will address them soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22962 Looks making sense to me in general. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r232528655 --- Diff: python/pyspark/tests.py --- @@ -618,10 +618,13 @@ def test_barrier_with_python_worker_reuse(self): """ Verify that BarrierTaskContext.barrier() with reused python worker. """ +self.sc._conf.set("spark.python.work.reuse", "true") --- End diff -- @xuanyuanking, this will probably need a separate suite case since it's also related with how we start the worker or not. You can make a new class, run a simple job to make sure workers are created and being resued, test it and stop. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22962 @HyukjinKwon Thanks for your review, comment address and PR description/title changed done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r232528333 --- Diff: python/pyspark/taskcontext.py --- @@ -144,10 +144,19 @@ def __init__(self): """Construct a BarrierTaskContext, use get instead""" pass +def __new__(cls): --- End diff -- Yep, do this in `_getOrCreate` has same effect, this is an over consider of https://github.com/apache/spark/blob/aec0af4a952df2957e21d39d1e0546a36ab7ab86/python/pyspark/taskcontext.py#L44-L45 Deleted in 02555b8. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22962 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22962 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4940/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22962#discussion_r232527808 --- Diff: python/pyspark/tests.py --- @@ -614,6 +614,18 @@ def context_barrier(x): times = rdd.barrier().mapPartitions(f).map(context_barrier).collect() self.assertTrue(max(times) - min(times) < 1) +def test_barrier_with_python_worker_reuse(self): +""" +Verify that BarrierTaskContext.barrier() with reused python worker. +""" +rdd = self.sc.parallelize(range(4), 4) --- End diff -- Thanks, done in 02555b8. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix BarrierTaskContext while pyth...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22962 **[Test build #98714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98714/testReport)** for PR 22962 at commit [`02555b8`](https://github.com/apache/spark/commit/02555b8fbdf85c3f2b5a92420479c168e14b573c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22954 **[Test build #98713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98713/testReport)** for PR 22954 at commit [`d9d9f98`](https://github.com/apache/spark/commit/d9d9f982d26a5dd2141515e0c9089243b7b93554). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22954 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22954 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4939/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23009: SPARK-26011: pyspark app with "spark.jars.packages" conf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23009 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23009: SPARK-26011: pyspark app with "spark.jars.packages" conf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23009 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22087 I also expose GMM's predictProbability. could you please make a final pass? @srowen @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23009: SPARK-26011: pyspark app with "spark.jars.packages" conf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23009 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23009: SPARK-26011: pyspark app with "spark.jars.package...
GitHub user shanyu opened a pull request: https://github.com/apache/spark/pull/23009 SPARK-26011: pyspark app with "spark.jars.packages" config does not work SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is set to false in SparkSubmit.scala. The fix is to resolve maven coordinates not only when args.isPython is true, but also when primary resource is spark-internal. Tested the patch with Livy submitting pyspark app, spark-submit, pyspark with or without packages config. Signed-off-by: Shanyu Zhao You can merge this pull request into a Git repository by running: $ git pull https://github.com/shanyu/spark shanyu-26011 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23009.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23009 commit c8424aff80e33f9a3f5a7d19a04442c7dac701a4 Author: Shanyu Zhao Date: 2018-11-12T02:57:01Z SPARK-26011: pyspark app with "spark.jars.packages" config does not work SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is set to false in SparkSubmit.scala. The fix is to resolve maven coordinates not only when args.isPython is true, but also when primary resource is spark-internal. Signed-off-by: Shanyu Zhao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22954: [SPARK-25981][R] Enables Arrow optimization from R DataF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22954 Yea .. I will make the followup works right away after this one get merged. Thanks @felixcheung. Let me address the rest of comments, and wait for Arrow release. @BryanCutler BTW, do you know the rough expected timing for Arrow 0.12.0 release? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232525184 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", { unsetHiveContext() }) +test_that("createDataFrame Arrow optimization", { + skip_if_not_installed("arrow") + skip_if_not_installed("withr") --- End diff -- Maybe we should hold it for now .. because I realised R API for Arrow requires R 3.5.x and Jenkins's one is 3.1.x if I remember this correctly. Ideally, we could probably do that via AppVeyor if everything goes fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r232525068 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", { unsetHiveContext() }) +test_that("createDataFrame Arrow optimization", { + skip_if_not_installed("arrow") + skip_if_not_installed("withr") + + conf <- callJMethod(sparkSession, "conf") + arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] + + callJMethod(conf, "set", "spark.sql.execution.arrow.enabled", "false") + tryCatch({ --- End diff -- Just to inject the finally .. :-) .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98710/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23008 **[Test build #98710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)** for PR 23008 at commit [`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22974 **[Test build #98712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98712/testReport)** for PR 22974 at commit [`7e97e45`](https://github.com/apache/spark/commit/7e97e450e110b9cdbe3610ee03e1ea65d5575d63). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22974 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4938/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22974 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23005 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4937/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23005 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23005: [SPARK-26005] [SQL] Upgrade ANTRL from 4.7 to 4.7.1
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23005 **[Test build #98711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98711/testReport)** for PR 23005 at commit [`4545977`](https://github.com/apache/spark/commit/45459776f2dd08f8180e152aae2702dfed190ed9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 @srowen I have some spare time, and will work on it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user wangtao605 commented on the issue: https://github.com/apache/spark/pull/21766 > @wangtao605 Do you mind documenting our behavior in our Spark SQL doc? Yes, it's ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23008 **[Test build #98710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)** for PR 23008 at commit [`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232520110 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- Sounds good if that's not difficult. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22764: [SPARK-25765][ML] Add training cost to BisectingKMeans s...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22764 @mgaido91 I'm on thanksgiving vacation, will be back to community to help code review on Nov 21st. Sorry for the delay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org