[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14959 **[Test build #66758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66758/consoleFull)** for PR 14959 at commit [`d692e71`](https://github.com/apache/spark/commit/d692e715e9c4da3a00f18c1693b60772492a80e4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentM...
Github user seyfe closed the pull request at: https://github.com/apache/spark/pull/15425 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...
Github user seyfe commented on the issue: https://github.com/apache/spark/pull/15425 Closing it since it's merged into 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15382#discussion_r82878712 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -757,7 +758,10 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH) - def warehousePath: String = new Path(getConf(WAREHOUSE_PATH)).toString + def warehousePath: String = { +val path = new Path(getConf(WAREHOUSE_PATH)) +FileSystem.get(path.toUri, new Configuration()).makeQualified(path).toString --- End diff -- There I'm looking at, for example, https://github.com/avulanov/spark/blob/ea24b59fe83c37dbab27579141b5c63cccee138d/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L141 in the test code. In non-test code I think it's the same source you copied, in SessionCatalog, line 154. Although these code locations can also deal with a scheme vs no scheme, it seemed to be easier to deal with it upfront, where it's returned to the rest of the code from the conf object. I think it'll be the same code, same complexity either place. The fact that `resolveURI` doesn't quite do what we want here suggests, I suppose, that lots of things in Spark aren't going to play well with a Windows path with spaces. See: ``` scala> resolveURI("file:///C:/My Programs/path") res14: java.net.URI = file:/Users/srowen/file:/C:/My%20Programs/path scala> resolveURI("/C:/My Programs/path") res15: java.net.URI = file:/C:/My%20Programs/path scala> resolveURI("C:/My Programs/path") res16: java.net.URI = file:/Users/srowen/C:/My%20Programs/path scala> resolveURI("/My Programs/path") res17: java.net.URI = file:/My%20Programs/path ``` The second (possibly alternative absolute Windows path with space) and fourth examples (Linux path with space) happen to come out right. If we're willing to also accept here that it's just what's going to work and not work, then, at least there is a working syntax for all scenarios when using this method. I believe there's an argument for further changing resolveURI to work with more variants here, but I'd have to even figure out first what is supposed to work and not work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to run in th...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15338 Jenkins add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to run in th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15338 **[Test build #66759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66759/consoleFull)** for PR 15338 at commit [`cb89755`](https://github.com/apache/spark/commit/cb89755fd99ca52a926b89fedbc24fbca43dc0ad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15436: [SPARK-17875] [BUILD] Remove unneeded direct dependence ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15436 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66752/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14788 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66713/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82728542 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- Why would `except` fail? Only rows from the first dataset should be returned, or am I missing something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14788 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropd...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/15427 [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates ## What changes were proposed in this pull request? Two issues regarding Dataset.dropduplicates: 1. Dataset.dropDuplicates should consider the columns with same column name We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns. 2. Dataset.dropDuplicates should not change the output of child plan We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows: val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS() // ds("_2") will cause analysis exception ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]) Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-dropduplicates Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15427.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15427 commit dd6405c003ea082b1c614f2efed4d1bcb2d6f5b9 Author: Liang-Chi HsiehDate: 2016-10-11T06:08:44Z Fix Dataset.dropduplicates. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15285 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15285 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66716/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew
Github user SaintBacchus commented on a diff in the pull request: https://github.com/apache/spark/pull/15297#discussion_r82730696 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -138,13 +138,16 @@ private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging * and the second item is a sequence of (shuffle block id, shuffle block size) tuples * describing the shuffle blocks that are stored at that block manager. */ - def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int) + def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int, + mapid: Int = -1) --- End diff -- It's better to use `Seq[Int]` to fetch many maps in one time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15360#discussion_r82730881 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils } } - test("generate column-level statistics and load them from hive metastore") { + test("test refreshing table stats of cached data source table by `ANALYZE TABLE` statement") { +val tableName = "tbl" +withTable(tableName) { + val tableIndent = TableIdentifier(tableName, Some("default")) + val catalog = spark.sessionState.catalog.asInstanceOf[HiveSessionCatalog] + sql(s"CREATE TABLE $tableName (key int) USING PARQUET") + sql(s"INSERT INTO $tableName SELECT 1") + sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") + // Table lookup will make the table cached. + catalog.lookupRelation(tableIndent) + val stats1 = catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation] +.catalogTable.get.stats.get + assert(stats1.sizeInBytes > 0) + assert(stats1.rowCount.contains(1)) + + sql(s"INSERT INTO $tableName SELECT 2") + sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") + catalog.lookupRelation(tableIndent) + val stats2 = catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation] +.catalogTable.get.stats.get + assert(stats2.sizeInBytes > stats1.sizeInBytes) + assert(stats2.rowCount.contains(2)) +} + } + + private def dataAndColStats(): (DataFrame, Seq[(StructField, ColumnStat)]) = { import testImplicits._ val intSeq = Seq(1, 2) val stringSeq = Seq("a", "bb") +val binarySeq = Seq("a", "bb").map(_.getBytes) val booleanSeq = Seq(true, false) - val data = intSeq.indices.map { i => - (intSeq(i), stringSeq(i), booleanSeq(i)) + (intSeq(i), stringSeq(i), binarySeq(i), booleanSeq(i)) } -val tableName = "table" -withTable(tableName) { - val df = data.toDF("c1", "c2", "c3") - df.write.format("parquet").saveAsTable(tableName) - val expectedColStatsSeq = df.schema.map { f => -val colStat = f.dataType match { - case IntegerType => -ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, intSeq.distinct.length.toLong)) - case StringType => -ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / stringSeq.length.toDouble, - stringSeq.map(_.length).max.toLong, stringSeq.distinct.length.toLong)) - case BooleanType => -ColumnStat(InternalRow(0L, booleanSeq.count(_.equals(true)).toLong, - booleanSeq.count(_.equals(false)).toLong)) -} -(f, colStat) +val df = data.toDF("c1", "c2", "c3", "c4") +val expectedColStatsSeq = df.schema.map { f => + val colStat = f.dataType match { +case IntegerType => + ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, intSeq.distinct.length.toLong)) +case StringType => + ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / stringSeq.length.toDouble, +stringSeq.map(_.length).max.toLong, stringSeq.distinct.length.toLong)) +case BinaryType => + ColumnStat(InternalRow(0L, binarySeq.map(_.length).sum / binarySeq.length.toDouble, +binarySeq.map(_.length).max.toLong)) +case BooleanType => + ColumnStat(InternalRow(0L, booleanSeq.count(_.equals(true)).toLong, +booleanSeq.count(_.equals(false)).toLong)) } + (f, colStat) +} +(df, expectedColStatsSeq) + } + + private def checkColStats( + tableName: String, + isDataSourceTable: Boolean, + expectedColStatsSeq: Seq[(StructField, ColumnStat)]): Unit = { +val readback = spark.table(tableName) +val stats = readback.queryExecution.analyzed.collect { + case rel: MetastoreRelation => +assert(!isDataSourceTable, "Expected a Hive serde table, but got a data source table") +rel.catalogTable.stats.get + case rel: LogicalRelation => +assert(isDataSourceTable, "Expected a data source table, but got a Hive serde table") +rel.catalogTable.get.stats.get +} +assert(stats.length == 1) +val columnStats = stats.head.colStats +assert(columnStats.size == expectedColStatsSeq.length) +expectedColStatsSeq.foreach { case (field, expectedColStat) => + StatisticsTest.checkColStat( +dataType = field.dataType,
[GitHub] spark pull request #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyroli...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15386 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15408 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15408 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66714/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew
Github user SaintBacchus commented on a diff in the pull request: https://github.com/apache/spark/pull/15297#discussion_r82728585 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SkewShuffleRowRDD.scala --- @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import java.util.Arrays + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow + +class SkewCoalescedPartitioner( +val parent: Partitioner, --- End diff -- Nit: code format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14702 **[Test build #66723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66723/consoleFull)** for PR 14702 at commit [`c7741f9`](https://github.com/apache/spark/commit/c7741f9560e619c1b3c2a30750e1386963cf5ece). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15295 **[Test build #66722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66722/consoleFull)** for PR 15295 at commit [`0ad8815`](https://github.com/apache/spark/commit/0ad8815b9042aadefa506e1e106822aa4bee810f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15427 **[Test build #66724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66724/consoleFull)** for PR 15427 at commit [`dd6405c`](https://github.com/apache/spark/commit/dd6405c003ea082b1c614f2efed4d1bcb2d6f5b9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15360#discussion_r82730527 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils } } - test("generate column-level statistics and load them from hive metastore") { + test("test refreshing table stats of cached data source table by `ANALYZE TABLE` statement") { +val tableName = "tbl" +withTable(tableName) { + val tableIndent = TableIdentifier(tableName, Some("default")) + val catalog = spark.sessionState.catalog.asInstanceOf[HiveSessionCatalog] + sql(s"CREATE TABLE $tableName (key int) USING PARQUET") + sql(s"INSERT INTO $tableName SELECT 1") + sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") + // Table lookup will make the table cached. + catalog.lookupRelation(tableIndent) + val stats1 = catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation] +.catalogTable.get.stats.get + assert(stats1.sizeInBytes > 0) + assert(stats1.rowCount.contains(1)) + + sql(s"INSERT INTO $tableName SELECT 2") + sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") + catalog.lookupRelation(tableIndent) + val stats2 = catalog.getCachedDataSourceTable(tableIndent).asInstanceOf[LogicalRelation] +.catalogTable.get.stats.get + assert(stats2.sizeInBytes > stats1.sizeInBytes) + assert(stats2.rowCount.contains(2)) +} + } + + private def dataAndColStats(): (DataFrame, Seq[(StructField, ColumnStat)]) = { import testImplicits._ val intSeq = Seq(1, 2) val stringSeq = Seq("a", "bb") +val binarySeq = Seq("a", "bb").map(_.getBytes) val booleanSeq = Seq(true, false) - val data = intSeq.indices.map { i => - (intSeq(i), stringSeq(i), booleanSeq(i)) + (intSeq(i), stringSeq(i), binarySeq(i), booleanSeq(i)) } -val tableName = "table" -withTable(tableName) { - val df = data.toDF("c1", "c2", "c3") - df.write.format("parquet").saveAsTable(tableName) - val expectedColStatsSeq = df.schema.map { f => -val colStat = f.dataType match { - case IntegerType => -ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, intSeq.distinct.length.toLong)) - case StringType => -ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / stringSeq.length.toDouble, - stringSeq.map(_.length).max.toLong, stringSeq.distinct.length.toLong)) - case BooleanType => -ColumnStat(InternalRow(0L, booleanSeq.count(_.equals(true)).toLong, - booleanSeq.count(_.equals(false)).toLong)) -} -(f, colStat) +val df = data.toDF("c1", "c2", "c3", "c4") +val expectedColStatsSeq = df.schema.map { f => + val colStat = f.dataType match { +case IntegerType => + ColumnStat(InternalRow(0L, intSeq.max, intSeq.min, intSeq.distinct.length.toLong)) +case StringType => + ColumnStat(InternalRow(0L, stringSeq.map(_.length).sum / stringSeq.length.toDouble, +stringSeq.map(_.length).max.toLong, stringSeq.distinct.length.toLong)) +case BinaryType => + ColumnStat(InternalRow(0L, binarySeq.map(_.length).sum / binarySeq.length.toDouble, +binarySeq.map(_.length).max.toLong)) +case BooleanType => + ColumnStat(InternalRow(0L, booleanSeq.count(_.equals(true)).toLong, +booleanSeq.count(_.equals(false)).toLong)) } + (f, colStat) +} +(df, expectedColStatsSeq) + } + + private def checkColStats( + tableName: String, + isDataSourceTable: Boolean, + expectedColStatsSeq: Seq[(StructField, ColumnStat)]): Unit = { +val readback = spark.table(tableName) +val stats = readback.queryExecution.analyzed.collect { + case rel: MetastoreRelation => +assert(!isDataSourceTable, "Expected a Hive serde table, but got a data source table") +rel.catalogTable.stats.get + case rel: LogicalRelation => +assert(isDataSourceTable, "Expected a data source table, but got a Hive serde table") +rel.catalogTable.get.stats.get +} +assert(stats.length == 1) +val columnStats = stats.head.colStats +assert(columnStats.size == expectedColStatsSeq.length) +expectedColStatsSeq.foreach { case (field, expectedColStat) => + StatisticsTest.checkColStat( +dataType = field.dataType,
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66725/consoleFull)** for PR 15285 at commit [`7cc6935`](https://github.com/apache/spark/commit/7cc6935cfade55a54d866bf2431bb28ef2f2544a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15377 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15377 **[Test build #66715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66715/consoleFull)** for PR 15377 at commit [`7485ffa`](https://github.com/apache/spark/commit/7485ffaa3df508f35df4b878ed715eb1ece0f4db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15360#discussion_r82731661 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala --- @@ -62,7 +62,7 @@ case class AnalyzeColumnCommand( val statistics = Statistics( sizeInBytes = newTotalSize, rowCount = Some(rowCount), -colStats = columnStats ++ catalogTable.stats.map(_.colStats).getOrElse(Map())) +colStats = catalogTable.stats.map(_.colStats).getOrElse(Map()) ++ columnStats) --- End diff -- Could you leave a code comment here to emphasize it? I am just afraid this might be modified without notice. Newly computed stats should override the existing stats. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15408#discussion_r82732276 --- Diff: core/src/main/java/org/apache/spark/io/NioBasedBufferedFileInputStream.java --- @@ -0,0 +1,127 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.io; + +import org.apache.spark.storage.StorageUtils; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.StandardOpenOption; + +/** + * {@link InputStream} implementation which uses direct buffer + * to read a file to avoid extra copy of data between Java and + * native memory which happens when using {@link java.io.BufferedInputStream}. + * Unfortunately, this is not something already available in JDK, + * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio, + * but does not support buffering. + * + */ +public final class NioBasedBufferedFileInputStream extends InputStream { + + private static int DEFAULT_BUFFER_SIZE_BYTES = 8192; + + private final ByteBuffer byteBuffer; + + private final FileChannel fileChannel; + + public NioBasedBufferedFileInputStream(File file, int bufferSizeInBytes) throws IOException { +byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes); +fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ); +byteBuffer.flip(); + } + + public NioBasedBufferedFileInputStream(File file) throws IOException { +this(file, DEFAULT_BUFFER_SIZE_BYTES); + } + + /** + * Checks weather data is left to be read from the input stream. + * @return true if data is left, false otherwise + * @throws IOException + */ + private boolean refill() throws IOException { +if (!byteBuffer.hasRemaining()) { + byteBuffer.clear(); + int nRead = fileChannel.read(byteBuffer); + if (nRead == -1) { --- End diff -- It is possible to read 0 bytes and then next time able to read more than 0 bytes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15324: [SPARK-16872][ML] Gaussian Naive Bayes Classifier
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/15324 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82730295 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- Ah, here is the codes I ran ```scala val dates = Seq( (new Date(0), BigDecimal.valueOf(1), new Timestamp(2)), (new Date(3), BigDecimal.valueOf(4), new Timestamp(5)) ).toDF("date", "timestamp", "decimal") val widenTypedRows = Seq( (new Timestamp(2), 10.5D, "string") ).toDF("date", "timestamp", "decimal") dates.except(widenTypedRows).collect() ``` and error message. ```java 23:10:05.331 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 107: No applicable constructor/method found for actual parameters "long"; candidates are: "public static java.sql.Date org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private Object[] values; /* 010 */ private org.apache.spark.sql.types.StructType schema; /* 011 */ /* 012 */ public SpecificSafeProjection(Object[] references) { /* 013 */ this.references = references; /* 014 */ mutableRow = (InternalRow) references[references.length - 1]; /* 015 */ /* 016 */ this.schema = (org.apache.spark.sql.types.StructType) references[0]; /* 017 */ /* 018 */ } /* 019 */ /* 020 */ /* 021 */ /* 022 */ public java.lang.Object apply(java.lang.Object _i) { /* 023 */ InternalRow i = (InternalRow) _i; /* 024 */ /* 025 */ values = new Object[3]; /* 026 */ /* 027 */ boolean isNull2 = i.isNullAt(0); /* 028 */ long value2 = isNull2 ? -1L : (i.getLong(0)); /* 029 */ boolean isNull1 = isNull2; /* 030 */ final java.sql.Date value1 = isNull1 ? null : org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); /* 031 */ isNull1 = value1 == null; /* 032 */ if (isNull1) { /* 033 */ values[0] = null; /* 034 */ } else { /* 035 */ values[0] = value1; /* 036 */ } /* 037 */ /* 038 */ boolean isNull4 = i.isNullAt(1); /* 039 */ double value4 = isNull4 ? -1.0 : (i.getDouble(1)); /* 040 */ /* 041 */ boolean isNull3 = isNull4; /* 042 */ java.math.BigDecimal value3 = null; /* 043 */ if (!isNull3) { /* 044 */ /* 045 */ Object funcResult = null; /* 046 */ funcResult = value4.toJavaBigDecimal(); /* 047 */ if (funcResult == null) { /* 048 */ isNull3 = true; /* 049 */ } else { /* 050 */ value3 = (java.math.BigDecimal) funcResult; /* 051 */ } /* 052 */ /* 053 */ } /* 054 */ isNull3 = value3 == null; /* 055 */ if (isNull3) { /* 056 */ values[1] = null; /* 057 */ } else { /* 058 */ values[1] = value3; /* 059 */ } /* 060 */ /* 061 */ boolean isNull6 = i.isNullAt(2); /* 062 */ UTF8String value6 = isNull6 ? null : (i.getUTF8String(2)); /* 063 */ boolean isNull5 = isNull6; /* 064 */ final java.sql.Timestamp value5 = isNull5 ? null : org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaTimestamp(value6); /* 065 */ isNull5 = value5 == null; /* 066 */ if (isNull5) { /* 067 */ values[2] = null; /* 068 */ } else { /* 069 */ values[2] = value5; /* 070 */ } /* 071 */ /* 072 */ final org.apache.spark.sql.Row value = new
[GitHub] spark issue #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15386 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15360: [SPARK-17073] [SQL] [FOLLOWUP] generate column-le...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15360#discussion_r82731979 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -358,50 +358,180 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils } } - test("generate column-level statistics and load them from hive metastore") { + test("test refreshing table stats of cached data source table by `ANALYZE TABLE` statement") { --- End diff -- Could you deduplicate the two test cases `refreshing table stats` and `refreshing column stats` by calling the same common function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15408#discussion_r82732392 --- Diff: core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java --- @@ -0,0 +1,129 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.io; + +import org.apache.spark.storage.StorageUtils; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.StandardOpenOption; + +/** + * {@link InputStream} implementation which uses direct buffer + * to read a file to avoid extra copy of data between Java and + * native memory which happens when using {@link java.io.BufferedInputStream}. + * Unfortunately, this is not something already available in JDK, + * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio, + * but does not support buffering. + * + * TODO: support {@link #mark(int)}/{@link #reset()} + * + */ +public final class NioBufferedFileInputStream extends InputStream { + + private static int DEFAULT_BUFFER_SIZE_BYTES = 8192; + + private final ByteBuffer byteBuffer; + + private final FileChannel fileChannel; + + public NioBufferedFileInputStream(File file, int bufferSizeInBytes) throws IOException { +byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes); +fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ); +byteBuffer.flip(); + } + + public NioBufferedFileInputStream(File file) throws IOException { +this(file, DEFAULT_BUFFER_SIZE_BYTES); + } + + /** + * Checks weather data is left to be read from the input stream. + * @return true if data is left, false otherwise + * @throws IOException + */ + private boolean refill() throws IOException { +if (!byteBuffer.hasRemaining()) { + byteBuffer.clear(); + int nRead = fileChannel.read(byteBuffer); + if (nRead <= 0) { --- End diff -- Should only return false if nRead is -1, which is the contract specified by FileChannel.read(). We should make sure the thing still works if nRead is 0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15285 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66718/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66718/consoleFull)** for PR 15285 at commit [`e5676a6`](https://github.com/apache/spark/commit/e5676a6d4e60e7b7446bf525fb7003cb26efc448). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15377 **[Test build #66719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66719/consoleFull)** for PR 15377 at commit [`df28bdd`](https://github.com/apache/spark/commit/df28bdddce5e4789a02cf7ef5dedab8b7c408630). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15408#discussion_r82733349 --- Diff: core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java --- @@ -0,0 +1,129 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.io; + +import org.apache.spark.storage.StorageUtils; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.StandardOpenOption; + +/** + * {@link InputStream} implementation which uses direct buffer + * to read a file to avoid extra copy of data between Java and + * native memory which happens when using {@link java.io.BufferedInputStream}. + * Unfortunately, this is not something already available in JDK, + * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio, + * but does not support buffering. + * + * TODO: support {@link #mark(int)}/{@link #reset()} + * + */ +public final class NioBufferedFileInputStream extends InputStream { + + private static int DEFAULT_BUFFER_SIZE_BYTES = 8192; + + private final ByteBuffer byteBuffer; + + private final FileChannel fileChannel; + + public NioBufferedFileInputStream(File file, int bufferSizeInBytes) throws IOException { +byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes); +fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ); +byteBuffer.flip(); + } + + public NioBufferedFileInputStream(File file) throws IOException { +this(file, DEFAULT_BUFFER_SIZE_BYTES); + } + + /** + * Checks weather data is left to be read from the input stream. + * @return true if data is left, false otherwise + * @throws IOException + */ + private boolean refill() throws IOException { +if (!byteBuffer.hasRemaining()) { + byteBuffer.clear(); + int nRead = fileChannel.read(byteBuffer); + if (nRead <= 0) { --- End diff -- The thing is, that will just result in an error on the next read(), because nothing was put in the buffer. Is that worse? I'm not even sure. But, calling it in a loop could result in an infinite loop. I'm accustomed to stopping reading when the result is 0 in cases like this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15285 **[Test build #66716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66716/consoleFull)** for PR 15285 at commit [`ef4f2b9`](https://github.com/apache/spark/commit/ef4f2b9dc1be33d56d7d4c93bddcfcc2a69a44e9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11459 **[Test build #66726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66726/consoleFull)** for PR 11459 at commit [`ab05aa6`](https://github.com/apache/spark/commit/ab05aa61b21a23531e43d082757a40bf2ab750d8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15377 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66715/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15408 **[Test build #66714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66714/consoleFull)** for PR 15408 at commit [`681ff62`](https://github.com/apache/spark/commit/681ff62409e1f6520057bdeafd991e2c12a0b232). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15285: [SPARK-17711] Compress rolled executor log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15285 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15377 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15377 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66719/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 This fails on AppVeyor - any idea? ``` . Error: SPARK-17811: can create DataFrame containing NA as date and time (@test_sparkSQL.R#388) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 105.0 failed 1 times, most recent failure: Lost task 0.0 in stage 105.0 (TID 109, localhost): java.lang.NegativeArraySizeException ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15230: [SPARK-17657] [SQL] Disallow Users to Change Table Type
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15230 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15421#discussion_r82733869 --- Diff: R/pkg/DESCRIPTION --- @@ -11,7 +11,8 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"), email = "felixche...@apache.org"), person(family = "The Apache Software Foundation", role = c("aut", "cph"))) URL: http://www.apache.org/ http://spark.apache.org/ -BugReports: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports +BugReports: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to ++Spark#ContributingtoSpark-ContributingBugReports --- End diff -- Do we know if this works if broken up into 2 lines? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15324: [SPARK-16872][ML] Gaussian Naive Bayes Classifier
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15324 **[Test build #66727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66727/consoleFull)** for PR 15324 at commit [`c6ad342`](https://github.com/apache/spark/commit/c6ad34258053a7df29e062ec6f1ab05a3b30302a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15230: [SPARK-17657] [SQL] Disallow Users to Change Table Type
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15230 cc @cloud-fan Could you review it again? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementat...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15342#discussion_r82735096 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -499,13 +414,38 @@ object KMeans { * @param data Training points as an `RDD` of `Vector` types. * @param k Number of clusters to create. * @param maxIterations Maximum number of iterations allowed. + * @param initializationMode The initialization algorithm. This can either be "random" or + * "k-means||". (default: "k-means||") + * @param seed Random seed for cluster initialization. Default is to generate seed based + * on system time. + */ + @Since("2.1.0") + def train(data: RDD[Vector], --- End diff -- +1 @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14788#discussion_r82735390 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2374,14 +2374,14 @@ object functions { * @group datetime_funcs * @since 1.5.0 */ - def date_add(start: Column, days: Int): Column = withExpr { DateAdd(start.expr, Literal(days)) } + def date_add(start: Column, days: Int): Column = withExpr { AddDays(start.expr, Literal(days)) } --- End diff -- It's confusing to users that `date_add` add days to the given date and `add_months` add months to the given date. I think `add_days` and `add_months` are more consistent. Other databases(e.g. MySQL, Postgres) only have `date_add` which adds interval to the given date, so that they don't need `add_days` and `add_months` respectively. The function name is already realsed and maybe hard to change, but changing the expression name to match the really logic seems good. @rxin any thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66720/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15426 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15424: [SPARK-17338][SQL][follow-up] add global temp vie...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15424 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15426 merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15408#discussion_r82737475 --- Diff: core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java --- @@ -0,0 +1,129 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.io; + +import org.apache.spark.storage.StorageUtils; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.StandardOpenOption; + +/** + * {@link InputStream} implementation which uses direct buffer + * to read a file to avoid extra copy of data between Java and + * native memory which happens when using {@link java.io.BufferedInputStream}. + * Unfortunately, this is not something already available in JDK, + * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio, + * but does not support buffering. + * + * TODO: support {@link #mark(int)}/{@link #reset()} + * + */ +public final class NioBufferedFileInputStream extends InputStream { + + private static int DEFAULT_BUFFER_SIZE_BYTES = 8192; + + private final ByteBuffer byteBuffer; + + private final FileChannel fileChannel; + + public NioBufferedFileInputStream(File file, int bufferSizeInBytes) throws IOException { +byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes); +fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ); +byteBuffer.flip(); + } + + public NioBufferedFileInputStream(File file) throws IOException { +this(file, DEFAULT_BUFFER_SIZE_BYTES); + } + + /** + * Checks weather data is left to be read from the input stream. + * @return true if data is left, false otherwise + * @throws IOException + */ + private boolean refill() throws IOException { +if (!byteBuffer.hasRemaining()) { + byteBuffer.clear(); + int nRead = fileChannel.read(byteBuffer); + if (nRead <= 0) { --- End diff -- It is pretty standard to rely only in -1. It is especially common in networking code to get 0 bytes read. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11459 **[Test build #66726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66726/consoleFull)** for PR 11459 at commit [`ab05aa6`](https://github.com/apache/spark/commit/ab05aa61b21a23531e43d082757a40bf2ab750d8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15426: [SPARK-17864][SQL] Mark data type APIs as stable ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15426 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82737858 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- Ah-ha, it seems we are widening types for set operators[1]. ```scala val dates = Seq( (new Date(0), BigDecimal.valueOf(1), new Timestamp(2)), (new Date(3), BigDecimal.valueOf(4), new Timestamp(5)) ).toDF("date", "timestamp", "decimal") val widenTypedRows = Seq( (new Timestamp(2), 10.5D, "string") ).toDF("date", "timestamp", "decimal") dates.printSchema() dates.except(widenTypedRows).toDF().printSchema() ``` prints ``` root |-- date: date (nullable = true) |-- timestamp: decimal(38,18) (nullable = true) |-- decimal: timestamp (nullable = true) root |-- date: timestamp (nullable = true) |-- timestamp: double (nullable = true) |-- decimal: string (nullable = true) ``` [1]https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L232-L249 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14847: [SPARK-17254][SQL] Filter can stop when the condition is...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14847 **[Test build #66730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66730/consoleFull)** for PR 14847 at commit [`ccac04f`](https://github.com/apache/spark/commit/ccac04f1e788dfb278618a10ee9220c89df6a61d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82738407 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- Ah, the reason why `intersect` was fine seems because it does contains any data. `intersect` seems also failed: ``` val dates = Seq(Tuple1(BigDecimal.valueOf(10.5))).toDF("decimal") val widenTypedRows = Seq(Tuple1(10.5D)).toDF("decimal") dates.intersect(widenTypedRows).collect() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementation / r...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15342 **[Test build #66729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66729/consoleFull)** for PR 15342 at commit [`ba52582`](https://github.com/apache/spark/commit/ba52582a1313be9d9febe215fe6f21b2d9be239f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15428 **[Test build #66731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66731/consoleFull)** for PR 15428 at commit [`a3e4308`](https://github.com/apache/spark/commit/a3e43086dcf6ecee20461567e2cc506db29f80a7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15429: [SPARK-17840] [DOCS] Add some pointers for wiki/C...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/15429 [SPARK-17840] [DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE ## What changes were proposed in this pull request? Link to contributing wiki in PR template, README.md ## How was this patch tested? Doc-only change, tested by Jekyll You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-17840 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15429.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15429 commit 29941b9880719d410ba826f173739abd2091b463 Author: Sean OwenDate: 2016-10-11T07:51:16Z Link to contributing wiki in PR template, README.md --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15428 **[Test build #66733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66733/consoleFull)** for PR 15428 at commit [`cd8113c`](https://github.com/apache/spark/commit/cd8113c456870dd89754d0d48bfe6d0931ad05bb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15429: [SPARK-17840] [DOCS] Add some pointers for wiki/CONTRIBU...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15429 **[Test build #66732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66732/consoleFull)** for PR 15429 at commit [`29941b9`](https://github.com/apache/spark/commit/29941b9880719d410ba826f173739abd2091b463). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15388: [SPARK-17821][SQL] Support And and Or in Expression Cano...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15388 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r82741770 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -128,8 +145,9 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @throws SparkException if a feature is < splits.head or > splits.last */ - private[feature] def binarySearchForBuckets(splits: Array[Double], feature: Double): Double = { -if (feature.isNaN) { + private[feature] def binarySearchForBuckets + (splits: Array[Double], feature: Double, flag: String): Double = { --- End diff -- Nit: I think the convention is to leave the open paren on the previous line Doesn't this need to handle "skip" and "error"? throw an exception on NaN if "error" or ignore it if "skip"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r82741203 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -270,10 +270,10 @@ private[ml] trait HasFitIntercept extends Params { private[ml] trait HasHandleInvalid extends Params { /** - * Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later. + * Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error), or keep (which will keep the bad values in certain way). More options may be added later. --- End diff -- I'm neutral on the complexity that this adds, but not against it. It gets a little funny to say "keep invalid data" but I think we discussed that on the JIRA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/15297#discussion_r82742056 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -138,13 +138,16 @@ private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging * and the second item is a sequence of (shuffle block id, shuffle block size) tuples * describing the shuffle blocks that are stored at that block manager. */ - def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int) + def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int, + mapid: Int = -1) --- End diff -- `mapid` => `mapId` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15388: [SPARK-17821][SQL] Support And and Or in Expressi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15388 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/15297#discussion_r82743168 --- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala --- @@ -48,7 +48,8 @@ private[spark] trait ShuffleManager { handle: ShuffleHandle, startPartition: Int, endPartition: Int, - context: TaskContext): ShuffleReader[K, C] + context: TaskContext, + mapid: Int = -1): ShuffleReader[K, C] --- End diff -- `mapid` => `mapId` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15405: [SPARK-15917][CORE] Added support for number of executor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15405 **[Test build #3323 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3323/consoleFull)** for PR 15405 at commit [`bffedac`](https://github.com/apache/spark/commit/bffedac0756c98861f44dfa0967ec8477c63c4cc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82744230 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -168,17 +168,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { * }}} */ override def visitShowColumns(ctx: ShowColumnsContext): LogicalPlan = withOrigin(ctx) { -val table = visitTableIdentifier(ctx.tableIdentifier) - -val lookupTable = Option(ctx.db) match { - case None => table - case Some(db) if table.database.exists(_ != db) => -operationNotAllowed( - s"SHOW COLUMNS with conflicting databases: '$db' != '${table.database.get}'", - ctx) - case Some(db) => TableIdentifier(table.identifier, Some(db.getText)) -} -ShowColumnsCommand(lookupTable) +ShowColumnsCommand(Option(ctx.db).map(_.getText), visitTableIdentifier(ctx.tableIdentifier)) --- End diff -- FYI, MySQL will treat `SHOW COLUMNS FROM db1.tbl1 FROM db2` as `SHOW COLUMNS FROM tbl1 FROM db2`, i.e. if `FROM database` is specified, it will just ignore the database specified in table name, instead of reporting error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15430: [SPARK-15957][Follow-up][ML][PySpark] Add Python ...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/15430 [SPARK-15957][Follow-up][ML][PySpark] Add Python API for RFormula forceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-15957-python Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15430.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15430 commit e34610c216c45d9500423e8c49425ca5a327c52e Author: Yanbo LiangDate: 2016-10-11T08:17:11Z Add Python API for RFormula forceIndexLabel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/9 I misread DB's meaning in my previous comment. I agree that the parameter settings of `initialModel`, if set, should take precedence. If it conflicts with an existing `k` then log a warning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82745403 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -168,17 +168,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { * }}} */ override def visitShowColumns(ctx: ShowColumnsContext): LogicalPlan = withOrigin(ctx) { -val table = visitTableIdentifier(ctx.tableIdentifier) - -val lookupTable = Option(ctx.db) match { - case None => table - case Some(db) if table.database.exists(_ != db) => -operationNotAllowed( - s"SHOW COLUMNS with conflicting databases: '$db' != '${table.database.get}'", - ctx) - case Some(db) => TableIdentifier(table.identifier, Some(db.getText)) -} -ShowColumnsCommand(lookupTable) +ShowColumnsCommand(Option(ctx.db).map(_.getText), visitTableIdentifier(ctx.tableIdentifier)) --- End diff -- Good point! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15427 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66724/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15427 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15428 **[Test build #66735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66735/consoleFull)** for PR 15428 at commit [`5cd58b7`](https://github.com/apache/spark/commit/5cd58b776883d5185f2b050317d260b924ebef5e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementat...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15342#discussion_r82734529 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -258,149 +252,106 @@ class KMeans private ( } } val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9 -logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) + - " seconds.") +logInfo(f"Initialization with $initializationMode took $initTimeInSeconds%.3f seconds.") -val active = Array.fill(numRuns)(true) -val costs = Array.fill(numRuns)(0.0) - -var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns) +var converged = false +var cost = 0.0 var iteration = 0 val iterationStartTime = System.nanoTime() -instr.foreach(_.logNumFeatures(centers(0)(0).vector.size)) +instr.foreach(_.logNumFeatures(centers.head.vector.size)) -// Execute iterations of Lloyd's algorithm until all runs have converged -while (iteration < maxIterations && !activeRuns.isEmpty) { - type WeightedPoint = (Vector, Long) - def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = { -axpy(1.0, x._1, y._1) -(y._1, x._2 + y._2) - } - - val activeCenters = activeRuns.map(r => centers(r)).toArray - val costAccums = activeRuns.map(_ => sc.doubleAccumulator) - - val bcActiveCenters = sc.broadcast(activeCenters) +// Execute iterations of Lloyd's algorithm until converged +while (iteration < maxIterations && !converged) { + val costAccum = sc.doubleAccumulator + val bcCenters = sc.broadcast(centers) // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => -val thisActiveCenters = bcActiveCenters.value -val runs = thisActiveCenters.length -val k = thisActiveCenters(0).length -val dims = thisActiveCenters(0)(0).vector.size +val thisCenters = bcCenters.value +val dims = thisCenters.head.vector.size -val sums = Array.fill(runs, k)(Vectors.zeros(dims)) -val counts = Array.fill(runs, k)(0L) +val sums = Array.fill(thisCenters.length)(Vectors.zeros(dims)) +val counts = Array.fill(thisCenters.length)(0L) points.foreach { point => - (0 until runs).foreach { i => -val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point) -costAccums(i).add(cost) -val sum = sums(i)(bestCenter) -axpy(1.0, point.vector, sum) -counts(i)(bestCenter) += 1 - } + val (bestCenter, cost) = KMeans.findClosest(thisCenters, point) + costAccum.add(cost) + val sum = sums(bestCenter) + axpy(1.0, point.vector, sum) + counts(bestCenter) += 1 } -val contribs = for (i <- 0 until runs; j <- 0 until k) yield { - ((i, j), (sums(i)(j), counts(i)(j))) -} -contribs.iterator - }.reduceByKey(mergeContribs).collectAsMap() - - bcActiveCenters.destroy(blocking = false) - - // Update the cluster centers and costs for each active run - for ((run, i) <- activeRuns.zipWithIndex) { -var changed = false -var j = 0 -while (j < k) { - val (sum, count) = totalContribs((i, j)) - if (count != 0) { -scal(1.0 / count, sum) -val newCenter = new VectorWithNorm(sum) -if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > epsilon * epsilon) { - changed = true -} -centers(run)(j) = newCenter - } - j += 1 -} -if (!changed) { - active(run) = false - logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") +counts.indices.filter(counts(_) > 0).map(j => (j, (sums(j), counts(j.iterator + }.reduceByKey { case ((sum1, count1), (sum2, count2)) => +axpy(1.0, sum2, sum1) +(sum1, count1 + count2) + }.collectAsMap() + + bcCenters.destroy(blocking = false) + + // Update the cluster centers and costs + converged = true + totalContribs.foreach { case (j, (sum, count)) => +scal(1.0 / count, sum) +val newCenter = new VectorWithNorm(sum) +if (converged && KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { + converged =
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15424 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15426 **[Test build #66721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66721/consoleFull)** for PR 15426 at commit [`0cf7e72`](https://github.com/apache/spark/commit/0cf7e7211f4b8112c776f1ac6bc06d6d204e6fd8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15426 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66721/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15424: [SPARK-17338][SQL][follow-up] add global temp view
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15424 **[Test build #66720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66720/consoleFull)** for PR 15424 at commit [`0ff26d0`](https://github.com/apache/spark/commit/0ff26d0050b12917f0c801ba61d43d0ae4970f81). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15426 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15342: [SPARK-11560] [MLLIB] Optimize KMeans implementat...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15342#discussion_r82736118 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -531,6 +471,7 @@ object KMeans { * "k-means||". (default: "k-means||") */ @Since("0.8.0") + @deprecated("Use train method without 'runs'", "2.1.0") def train( --- End diff -- +1 @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15408: [SPARK-17839][CORE] Use Nio's directbuffer instea...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15408#discussion_r82737052 --- Diff: core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java --- @@ -0,0 +1,129 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.io; + +import org.apache.spark.storage.StorageUtils; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.StandardOpenOption; + +/** + * {@link InputStream} implementation which uses direct buffer + * to read a file to avoid extra copy of data between Java and + * native memory which happens when using {@link java.io.BufferedInputStream}. + * Unfortunately, this is not something already available in JDK, + * {@link sun.nio.ch.ChannelInputStream} supports reading a file using nio, + * but does not support buffering. + * + * TODO: support {@link #mark(int)}/{@link #reset()} + * + */ +public final class NioBufferedFileInputStream extends InputStream { + + private static int DEFAULT_BUFFER_SIZE_BYTES = 8192; + + private final ByteBuffer byteBuffer; + + private final FileChannel fileChannel; + + public NioBufferedFileInputStream(File file, int bufferSizeInBytes) throws IOException { +byteBuffer = ByteBuffer.allocateDirect(bufferSizeInBytes); +fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.READ); +byteBuffer.flip(); + } + + public NioBufferedFileInputStream(File file) throws IOException { +this(file, DEFAULT_BUFFER_SIZE_BYTES); + } + + /** + * Checks weather data is left to be read from the input stream. + * @return true if data is left, false otherwise + * @throws IOException + */ + private boolean refill() throws IOException { +if (!byteBuffer.hasRemaining()) { + byteBuffer.clear(); + int nRead = fileChannel.read(byteBuffer); + if (nRead <= 0) { --- End diff -- It could be 0 forever though (dunno, FS error?) and then this creates an infinite loop. I went hunting through the SDK for some examples of dealing with this, because this has always been a tricky part of even `InputStream`. `java.nio.Files`: ``` private static long copy(InputStream source, OutputStream sink) throws IOException { long nread = 0L; byte[] buf = new byte[BUFFER_SIZE]; int n; while ((n = source.read(buf)) > 0) { sink.write(buf, 0, n); nread += n; } return nread; } ... private static byte[] read(InputStream source, int initialSize) throws IOException { int capacity = initialSize; byte[] buf = new byte[capacity]; int nread = 0; int n; for (;;) { // read to EOF which may read more or less than initialSize (eg: file // is truncated while we are reading) while ((n = source.read(buf, nread, capacity - nread)) > 0) nread += n; // if last call to source.read() returned -1, we are done // otherwise, try to read one more byte; if that failed we're done too if (n < 0 || (n = source.read()) < 0) break; // one more byte was read; need to allocate a larger buffer if (capacity <= MAX_BUFFER_SIZE - capacity) { capacity = Math.max(capacity << 1, BUFFER_SIZE); } else { if (capacity == MAX_BUFFER_SIZE) throw new OutOfMemoryError("Required array size too large"); capacity = MAX_BUFFER_SIZE; } buf = Arrays.copyOf(buf, capacity); buf[nread++] = (byte)n; } return (capacity == nread) ? buf : Arrays.copyOf(buf, nread); } ``` The first instance seems to assume that 0 bytes means it should stop. In the second instance it forces the `InputStream` to give a definitive answer by
[GitHub] spark pull request #14788: [SPARK-17174][SQL] Add the support for TimestampT...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14788#discussion_r82737073 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2374,14 +2374,14 @@ object functions { * @group datetime_funcs * @since 1.5.0 */ - def date_add(start: Column, days: Int): Column = withExpr { DateAdd(start.expr, Literal(days)) } + def date_add(start: Column, days: Int): Column = withExpr { AddDays(start.expr, Literal(days)) } --- End diff -- I don't get what you were suggesting here. Wouldn't it make more sense to make DateAdd expression support both adding Interval type and adding IntegralType (for days)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14788: [SPARK-17174][SQL] Add the support for TimestampType for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14788 **[Test build #66728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66728/consoleFull)** for PR 14788 at commit [`cd78330`](https://github.com/apache/spark/commit/cd783307fafa7987505906101e8e90148c589214). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15426: [SPARK-17864][SQL] Mark data type APIs as stable (not De...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15426 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11459 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66726/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11459 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14702 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66723/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14702 **[Test build #66723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66723/consoleFull)** for PR 14702 at commit [`c7741f9`](https://github.com/apache/spark/commit/c7741f9560e619c1b3c2a30750e1386963cf5ece). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/15428 [SPARK-17219][ML] enchanced NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by passing "keep", "skip", or "error"(default) to setHandleInvalid. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleInvalid("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite and BucketizerSuite Signed-off-by: VinceShiehYou can merge this pull request into a Git repository by running: $ git pull https://github.com/VinceShieh/spark spark-17219_followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15428.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15428 commit a3e43086dcf6ecee20461567e2cc506db29f80a7 Author: VinceShieh Date: 2016-10-10T02:33:09Z [SPARK-17219][ML] enchance NaN value handling in Bucketizer This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. We provided user when dealing NaN value in the dataset with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by passing "keep", "skip", or "error"(default) to setHandleInvalid. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleInvalid("skip") Signed-off-by: VinceShieh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15428: [SPARK-17219][ML] enchanced NaN value handling in Bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15428 **[Test build #66731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66731/consoleFull)** for PR 15428 at commit [`a3e4308`](https://github.com/apache/spark/commit/a3e43086dcf6ecee20461567e2cc506db29f80a7). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15425 **[Test build #3321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3321/consoleFull)** for PR 15425 at commit [`678ee6b`](https://github.com/apache/spark/commit/678ee6b1d6308a81a5c2d83a196144f29c80434b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14847: [SPARK-17254][SQL] Filter can stop when the condition is...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14847 @viirya can you try to create a new operator for this optimization and make it work with whole-stage-codegen? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15072#discussion_r82743571 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils private[sql] object Dataset { def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T] = { -new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]]) +val encoder = implicitly[Encoder[T]] +if (encoder.clsTag.runtimeClass == classOf[Row]) { + // We should use the encoder generated from the executed plan rather than the existing + // encoder for DataFrame because the types of columns can be varied due to widening types. + // See SPARK-17123. This is a bit hacky. Maybe we should find a better way to do this. + ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]] +} else { + new Dataset(sparkSession, logicalPlan, encoder) +} --- End diff -- Yeah, I forgot about type widening. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org