[GitHub] spark pull request #15172: [SPARK-13331] AES support for over-the-wire encry...
Github user cjjnjust commented on a diff in the pull request: https://github.com/apache/spark/pull/15172#discussion_r80846949 --- Diff: common/network-common/src/main/java/org/apache/spark/network/sasl/aes/SparkAesCipher.java --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.sasl.aes; + +import java.io.IOException; +import java.security.InvalidAlgorithmParameterException; +import java.security.InvalidKeyException; +import java.security.NoSuchAlgorithmException; +import java.util.Arrays; +import java.util.Properties; +import javax.crypto.Cipher; +import javax.crypto.Mac; +import javax.crypto.SecretKey; +import javax.crypto.ShortBufferException; +import javax.crypto.spec.SecretKeySpec; +import javax.crypto.spec.IvParameterSpec; +import javax.security.sasl.SaslException; + +import org.apache.commons.crypto.cipher.CryptoCipher; +import org.apache.commons.crypto.utils.Utils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * AES cipher for encryption and decryption. + */ +public class SparkAesCipher { --- End diff -- Thanks for elaboration. I agree with you to implement AES encryption not SASL encryption. And here are reasons why AES encryption implemented like this: First of all, to enable AES encryption it has to negotiate a pair of key and input vector through SASL, that's the connection between AES and SASL Encryption. Secondly, the patch is not intended to implement a SASL encryption but to implement the AES encryption by utilizing the existing encryption code provided by SASL encryption (the SaslEncryption class, which wraps the SaslEncryptionBackend as netty channelhandlers). So the patch do implement netty channel handler while function names are wrap/unwrap against write. Lastly, I think the code cannot be simpler if implement separated channel handlers, since AES encryption needs key negotiation, Integrity check(seq number and hmac) as I mentioned before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15263 Yeah, @HyukjinKwon . This PR is just to rewrite the implementation. No need to resolve all the comments here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15257: [SPARK-17683][SQL] Support ArrayType in Literal.apply
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15257 **[Test build #66023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66023/consoleFull)** for PR 15257 at commit [`c9cdd29`](https://github.com/apache/spark/commit/c9cdd29424c27c600a9e22c0b8ddf1a5998b6194). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14897#discussion_r80846005 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala --- @@ -37,39 +37,14 @@ import org.apache.spark.util.{MutableURLClassLoader, Utils} */ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging { - /** - * Class for caching query results reused in future executions. - */ - val cacheManager: CacheManager = new CacheManager - - /** - * A listener for SQL-specific [[org.apache.spark.scheduler.SparkListenerEvent]]s. - */ - val listener: SQLListener = createListenerAndUI(sparkContext) - + // Load hive-site.xml into hadoopConf and determine the warehouse path we want to use, based on + // the config from both hive and Spark SQL. Finally set the warehouse config value to sparkConf. { val configFile = Utils.getContextOrSparkClassLoader.getResource("hive-site.xml") if (configFile != null) { sparkContext.hadoopConfiguration.addResource(configFile) } - } - - /** - * A catalog that interacts with external systems. - */ - lazy val externalCatalog: ExternalCatalog = -SharedState.reflect[ExternalCatalog, SparkConf, Configuration]( - SharedState.externalCatalogClassName(sparkContext.conf), - sparkContext.conf, - sparkContext.hadoopConfiguration) - - /** - * A classloader used to load all user-added jar. - */ - val jarClassLoader = new NonClosableMutableURLClassLoader( -org.apache.spark.util.Utils.getContextOrSparkClassLoader) - { // Set the Hive metastore warehouse path to the one we use val tempConf = new SQLConf --- End diff -- it's really bad that we put this initializing code in the middle of this class. This initializing code sets the warehouse path to sparkConf, which means it changes the state of the instance. If any fields that are declared before this initializing code and need to read this states, e.g. `externalCatalog`, they must be `lazy val`. I think it's more reasonable to less error-prone to put this initializing code in the beginning of this class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelatio...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15263#discussion_r80845890 --- Diff: docs/sql-programming-guide.md --- @@ -1058,7 +1059,21 @@ the Data Sources API. The following options are supported: The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). - + + +batchsize + + The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. + + + + +isolationLevel + + The transaction isolation level, which applies to current connection. Please refer the documenation in java.sql.Connection. --- End diff -- `current connection` -> `the current connection` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelatio...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15263#discussion_r80845726 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -50,67 +51,48 @@ class JdbcRelationProvider extends CreatableRelationProvider JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, properties)(sqlContext.sparkSession) } - /* - * The following structure applies to this code: - * |tableExists| !tableExists - * - * Ignore | BaseRelation | CreateTable, saveTable, BaseRelation - * ErrorIfExists | ERROR | CreateTable, saveTable, BaseRelation - * Overwrite* | (DropTable, CreateTable,) | CreateTable, saveTable, BaseRelation - * | saveTable, BaseRelation | - * Append | saveTable, BaseRelation | CreateTable, saveTable, BaseRelation - * - * *Overwrite & tableExists with truncate, will not drop & create, but instead truncate - */ override def createRelation( sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], - data: DataFrame): BaseRelation = { -val jdbcOptions = new JDBCOptions(parameters) -val url = jdbcOptions.url -val table = jdbcOptions.table - + df: DataFrame): BaseRelation = { +val options = new JDBCOptions(parameters) +val url = options.url +val table = options.table +val createTableOptions = options.createTableOptions +val isTruncate = options.isTruncate val props = new Properties() props.putAll(parameters.asJava) -val conn = JdbcUtils.createConnectionFactory(url, props)() +val conn = JdbcUtils.createConnectionFactory(url, props)() try { val tableExists = JdbcUtils.tableExists(conn, url, table) + if (tableExists) { +mode match { + case SaveMode.Overwrite => +if (isTruncate && isCascadingTruncateTable(url).contains(false)) { + // In this case, we should truncate table and then load. + truncateTable(conn, table) + saveTable(df, url, table, props) +} else { + // Otherwise, do not truncate but just drop. + dropTable(conn, table) + createTable(df, url, table, createTableOptions, conn) + saveTable(df, url, table, props) +} + case SaveMode.Append => +saveTable(df, url, table, props) - val (doCreate, doSave) = (mode, tableExists) match { -case (SaveMode.Ignore, true) => (false, false) -case (SaveMode.ErrorIfExists, true) => throw new AnalysisException( - s"Table or view '$table' already exists, and SaveMode is set to ErrorIfExists.") -case (SaveMode.Overwrite, true) => - if (jdbcOptions.isTruncate && JdbcUtils.isCascadingTruncateTable(url) == Some(false)) { -JdbcUtils.truncateTable(conn, table) -(false, true) - } else { -JdbcUtils.dropTable(conn, table) -(true, true) - } -case (SaveMode.Append, true) => (false, true) -case (_, true) => throw new IllegalArgumentException(s"Unexpected SaveMode, '$mode'," + - " for handling existing tables.") -case (_, false) => (true, true) - } + case SaveMode.ErrorIfExists => +throw new AnalysisException( + s"Table or view '$table' already exists, and SaveMode is set to ErrorIfExists.") --- End diff -- `Table or view '$table' already exists. SaveMode: ErrorIfExists.` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80845291 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2105,6 +2107,59 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, shuffleDepC)) } + test("After one stage is aborted for too many failed attempts, subsequent stages" + +"still behave correctly on fetch failures") { +def fetchFailJob: Unit = { + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") +case (x, _) => x + }.count() +} + +def successJob: Unit = { + object FailThisAttempt { +val _fail = new AtomicBoolean(true) + } + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) && FailThisAttempt._fail.getAndSet(false) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") + } +} + +failAfter(60.seconds) { + val e = intercept[SparkException] { +fetchFailJob + } + assert(e.getMessage.contains("org.apache.spark.shuffle.FetchFailedException")) +} + +// The following job that fails due to fetching failure will hang without +// the fix for SPARK-17644 --- End diff -- Can you change to something like "Run a second job that will fail due to a fetch failure. This job will hang without the fix for SPARK-17644." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80845128 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2105,6 +2107,59 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, shuffleDepC)) } + test("After one stage is aborted for too many failed attempts, subsequent stages" + +"still behave correctly on fetch failures") { +def fetchFailJob: Unit = { + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") +case (x, _) => x + }.count() +} + +def successJob: Unit = { + object FailThisAttempt { +val _fail = new AtomicBoolean(true) + } + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) && FailThisAttempt._fail.getAndSet(false) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") + } +} + +failAfter(60.seconds) { + val e = intercept[SparkException] { +fetchFailJob + } + assert(e.getMessage.contains("org.apache.spark.shuffle.FetchFailedException")) +} + +// The following job that fails due to fetching failure will hang without +// the fix for SPARK-17644 +failAfter(60.seconds) { --- End diff -- I think a shorter timeout would be appropriate here to avoid slow-ness when this fails...maybe 10 seconds? That still seems plenty conservative since the resubmit timeout is 200 millis. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80845459 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2105,6 +2107,59 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, shuffleDepC)) } + test("After one stage is aborted for too many failed attempts, subsequent stages" + +"still behave correctly on fetch failures") { +def fetchFailJob: Unit = { --- End diff -- to make this a little more descriptive / easy to read, how about calling the helper "runJobWithPersistentFetchFailure" and then add a comment that says "Runs a job that always encounters a fetch failure, so should eventually be aborted." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80845506 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2105,6 +2107,59 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, shuffleDepC)) } + test("After one stage is aborted for too many failed attempts, subsequent stages" + +"still behave correctly on fetch failures") { +def fetchFailJob: Unit = { + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") +case (x, _) => x + }.count() +} + +def successJob: Unit = { --- End diff -- and for this perhaps call it "runJobWithTemporaryFetchFailure" and then comment saying "Runs a job that encounters a single fetch failure but succeeds on the second attempt" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80845188 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2105,6 +2107,59 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, shuffleDepC)) } + test("After one stage is aborted for too many failed attempts, subsequent stages" + +"still behave correctly on fetch failures") { +def fetchFailJob: Unit = { + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") +case (x, _) => x + }.count() +} + +def successJob: Unit = { + object FailThisAttempt { +val _fail = new AtomicBoolean(true) + } + val rdd1 = sc.makeRDD(Array(1, 2, 3, 4), 2).map(x => (x, 1)).groupByKey() + val shuffleHandle = +rdd1.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]].shuffleHandle + rdd1.map { +case (x, _) if (x == 1) && FailThisAttempt._fail.getAndSet(false) => + throw new FetchFailedException( +BlockManagerId("1", "1", 1), shuffleHandle.shuffleId, 0, 0, "test") + } +} + +failAfter(60.seconds) { + val e = intercept[SparkException] { +fetchFailJob + } + assert(e.getMessage.contains("org.apache.spark.shuffle.FetchFailedException")) +} + +// The following job that fails due to fetching failure will hang without +// the fix for SPARK-17644 +failAfter(60.seconds) { + val e = intercept[SparkException] { +fetchFailJob + } + assert(e.getMessage.contains("org.apache.spark.shuffle.FetchFailedException")) +} + +failAfter(60.seconds) { + try { +successJob + } catch { +case e: Throwable => fail("this job should success") --- End diff -- Can you make this a little more descriptive -- maybe "A job with one fetch failure should eventually succeed"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80844742 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1256,11 +1257,13 @@ class DAGScheduler( if (disallowStageRetryForTest) { abortStage(failedStage, "Fetch failure will not retry stage due to testing config", None) +abortedStage = true } else if (failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) { abortStage(failedStage, s"$failedStage (${failedStage.name}) " + s"has failed the maximum allowable number of " + s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " + s"Most recent failure reason: ${failureMessage}", None) +abortedStage = true } else if (failedStages.isEmpty) { --- End diff -- Instead of having the abortedStage variable, how about re-writing the "else if" statement to be: else { if (failedStages.isEmpty) { ... stuff currently in else-if ... } failedStages += failedStage failedStages += mapStage } That eliminates the confusion of multiple abortStage variables, as @zsxwing pointed out, and also makes the relationship between (i) adding the stage to failed stages and (ii) scheduling the Resubmit event more clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80844841 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2105,6 +2107,59 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou assert(scheduler.getShuffleDependencies(rddE) === Set(shuffleDepA, shuffleDepC)) } + test("After one stage is aborted for too many failed attempts, subsequent stages" + +"still behave correctly on fetch failures") { --- End diff -- Can you add the JIRA number here? That helps with tracking tests in the future. So something like "[SPARK-17644] After one stage is aborted..." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15213: [SPARK-17644] [CORE] Do not add failedStages when...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/15213#discussion_r80844780 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -31,6 +32,7 @@ import org.apache.spark._ import org.apache.spark.broadcast.BroadcastManager import org.apache.spark.rdd.RDD import org.apache.spark.scheduler.SchedulingMode.SchedulingMode +import org.apache.spark.shuffle.FetchFailedException --- End diff -- nit: can you group this with the next import (so import org.apache.spark.shuffle.{FetchFailedException, MetadataFetchFailedException} --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15263 @gatorsmile Would this be sensible if I submit another PR to clean up options for https://github.com/apache/spark/pull/15263#issuecomment-250072008 later as strictly the purpose of this PR is to clean up `JdbcRelationProvider`? I'd like to unify all the options within `JDBCOptions` and maybe I could fix the documentation together there if you think it makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15257: [SPARK-17683][SQL] Support ArrayType in Literal.apply
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15257 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66022/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15257: [SPARK-17683][SQL] Support ArrayType in Literal.apply
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15257 **[Test build #66022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66022/consoleFull)** for PR 15257 at commit [`c57d93e`](https://github.com/apache/spark/commit/c57d93e879cbfe6ad673f1c9cd31277c659f766d). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15257: [SPARK-17683][SQL] Support ArrayType in Literal.apply
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15257 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelatio...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15263#discussion_r80845381 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -50,67 +51,48 @@ class JdbcRelationProvider extends CreatableRelationProvider JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, properties)(sqlContext.sparkSession) } - /* - * The following structure applies to this code: - * |tableExists| !tableExists - * - * Ignore | BaseRelation | CreateTable, saveTable, BaseRelation - * ErrorIfExists | ERROR | CreateTable, saveTable, BaseRelation - * Overwrite* | (DropTable, CreateTable,) | CreateTable, saveTable, BaseRelation - * | saveTable, BaseRelation | - * Append | saveTable, BaseRelation | CreateTable, saveTable, BaseRelation - * - * *Overwrite & tableExists with truncate, will not drop & create, but instead truncate - */ override def createRelation( sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], - data: DataFrame): BaseRelation = { -val jdbcOptions = new JDBCOptions(parameters) -val url = jdbcOptions.url -val table = jdbcOptions.table - + df: DataFrame): BaseRelation = { +val options = new JDBCOptions(parameters) +val url = options.url +val table = options.table +val createTableOptions = options.createTableOptions +val isTruncate = options.isTruncate val props = new Properties() props.putAll(parameters.asJava) -val conn = JdbcUtils.createConnectionFactory(url, props)() +val conn = JdbcUtils.createConnectionFactory(url, props)() try { val tableExists = JdbcUtils.tableExists(conn, url, table) + if (tableExists) { +mode match { + case SaveMode.Overwrite => +if (isTruncate && isCascadingTruncateTable(url).contains(false)) { + // In this case, we should truncate table and then load. + truncateTable(conn, table) + saveTable(df, url, table, props) +} else { + // Otherwise, do not truncate but just drop. --- End diff -- This comment looks weird. In RDBMS, users might do `drop and recreate` a table, instead of truncating it. Thus, please change it to `// Otherwise, do not truncate but just drop and recreate` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15257: [SPARK-17683][SQL] Support ArrayType in Literal.apply
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15257 **[Test build #66022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66022/consoleFull)** for PR 15257 at commit [`c57d93e`](https://github.com/apache/spark/commit/c57d93e879cbfe6ad673f1c9cd31277c659f766d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15257: [SPARK-17683][SQL] Support ArrayType in Literal.a...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/15257#discussion_r80845375 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala --- @@ -52,13 +53,51 @@ object Literal { case t: Timestamp => Literal(DateTimeUtils.fromJavaTimestamp(t), TimestampType) case d: Date => Literal(DateTimeUtils.fromJavaDate(d), DateType) case a: Array[Byte] => Literal(a, BinaryType) +case a: Array[_] => + val elementType = componentTypeToDataType(a.getClass.getComponentType()) + val dataType = ArrayType(elementType) + val convert = CatalystTypeConverters.createToCatalystConverter(dataType) + Literal(convert(a), dataType) case i: CalendarInterval => Literal(i, CalendarIntervalType) case null => Literal(null, NullType) case v: Literal => v case _ => throw new RuntimeException("Unsupported literal type " + v.getClass + " " + v) } + private def componentTypeToDataType(clz: Class[_]): DataType = clz match { --- End diff -- @jaceklaskowski I checked these similar code though, I couldn't find better ideas to use reflection stuffs here in `ScalaReflection` because `Literal#apply` cannot catch type signatures in runtime. If possible, any better idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14897 **[Test build #66021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66021/consoleFull)** for PR 14897 at commit [`a683fba`](https://github.com/apache/spark/commit/a683fba7fdff0dd101be231a84a88fb5b8f3aef4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15254: [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConv...
Github user JasonMWhite commented on the issue: https://github.com/apache/spark/pull/15254 @JoshRosen @lins05 As requested, I've removed all remaining explicit mentions of `ListConverter` and `MapConverter` as they seemed to all be doing the same thing - getting around https://github.com/bartdag/py4j/issues/161 and https://github.com/bartdag/py4j/issues/160. I'm not familiar with the code in pyspark-ml and pyspark-mllib, but it seemed straightforward and didn't introduce any regressions in the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelatio...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15263#discussion_r80844741 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -50,67 +51,48 @@ class JdbcRelationProvider extends CreatableRelationProvider JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, properties)(sqlContext.sparkSession) } - /* - * The following structure applies to this code: - * |tableExists| !tableExists - * - * Ignore | BaseRelation | CreateTable, saveTable, BaseRelation - * ErrorIfExists | ERROR | CreateTable, saveTable, BaseRelation - * Overwrite* | (DropTable, CreateTable,) | CreateTable, saveTable, BaseRelation - * | saveTable, BaseRelation | - * Append | saveTable, BaseRelation | CreateTable, saveTable, BaseRelation - * - * *Overwrite & tableExists with truncate, will not drop & create, but instead truncate - */ override def createRelation( sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], - data: DataFrame): BaseRelation = { -val jdbcOptions = new JDBCOptions(parameters) -val url = jdbcOptions.url -val table = jdbcOptions.table - + df: DataFrame): BaseRelation = { +val options = new JDBCOptions(parameters) +val url = options.url +val table = options.table +val createTableOptions = options.createTableOptions +val isTruncate = options.isTruncate val props = new Properties() props.putAll(parameters.asJava) -val conn = JdbcUtils.createConnectionFactory(url, props)() +val conn = JdbcUtils.createConnectionFactory(url, props)() try { val tableExists = JdbcUtils.tableExists(conn, url, table) + if (tableExists) { +mode match { + case SaveMode.Overwrite => +if (isTruncate && isCascadingTruncateTable(url).contains(false)) { + // In this case, we should truncate table and then load. + truncateTable(conn, table) + saveTable(df, url, table, props) +} else { + // Otherwise, do not truncate but just drop. + dropTable(conn, table) + createTable(df, url, table, createTableOptions, conn) + saveTable(df, url, table, props) +} + case SaveMode.Append => +saveTable(df, url, table, props) - val (doCreate, doSave) = (mode, tableExists) match { -case (SaveMode.Ignore, true) => (false, false) -case (SaveMode.ErrorIfExists, true) => throw new AnalysisException( - s"Table or view '$table' already exists, and SaveMode is set to ErrorIfExists.") -case (SaveMode.Overwrite, true) => - if (jdbcOptions.isTruncate && JdbcUtils.isCascadingTruncateTable(url) == Some(false)) { -JdbcUtils.truncateTable(conn, table) -(false, true) - } else { -JdbcUtils.dropTable(conn, table) -(true, true) - } -case (SaveMode.Append, true) => (false, true) -case (_, true) => throw new IllegalArgumentException(s"Unexpected SaveMode, '$mode'," + - " for handling existing tables.") -case (_, false) => (true, true) - } + case SaveMode.ErrorIfExists => +throw new AnalysisException( + s"Table or view '$table' already exists, and SaveMode is set to ErrorIfExists.") - if (doCreate) { -val schema = JdbcUtils.schemaString(data, url) -// To allow certain options to append when create a new table, which can be -// table_options or partition_options. -// E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8" -val createtblOptions = jdbcOptions.createTableOptions -val sql = s"CREATE TABLE $table ($schema) $createtblOptions" -val statement = conn.createStatement -try { - statement.executeUpdate(sql) -} finally { - statement.close() + case SaveMode.Ignore => // Just ignore this case. --- End diff -- Please explicitly explain the behavior. `// Just ignore this case` is almost useless. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the
[GitHub] spark issue #15275: [SPARK-17702][SQL] Code generation including too many mu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15275 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15275: [SPARK-17702][SQL] Code generation including too many mu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15275 **[Test build #66020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66020/consoleFull)** for PR 15275 at commit [`858a3ec`](https://github.com/apache/spark/commit/858a3ec54f78b4865c19ec46aebc9320b36f02b5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15275: [SPARK-17702][SQL] Code generation including too many mu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15275 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66020/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15273 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66018/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15273 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15273 **[Test build #66018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66018/consoleFull)** for PR 15273 at commit [`e2dfb25`](https://github.com/apache/spark/commit/e2dfb25aeef021788208a0f05420d211916d2db5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15273 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66017/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15273 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15273 **[Test build #66017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66017/consoleFull)** for PR 15273 at commit [`f0df3a0`](https://github.com/apache/spark/commit/f0df3a005be301ac627beb53a233ece06269ced8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15263 Good idea. Yes I will. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15263 Regarding the JDBC options, which properties are used for the table reading? Which ones are used for writing? Can you please improve the document and make users easily find the answer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15258: [SPARK-17689][SQL][STREAMING][WIP] added excludeFiles op...
Github user ScrapCodes commented on the issue: https://github.com/apache/spark/pull/15258 @frreiss That is correct, I will look into it. Moving the PR to WIP, as at least I would like to document these situations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15239: [SPARK-17665][SPARKR] Support options/mode all for read/...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15239 re: test - I mean for each function we are adding the `...` param, that we have a test for calling it without the extra stuff, ie. one test with ``` write.json(df, jsonPath) ``` and one test with ``` write.json(df, jsonPath, compression = T) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15275: [SPARK-17702][SQL] Code generation including too many mu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15275 **[Test build #66020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66020/consoleFull)** for PR 15275 at commit [`858a3ec`](https://github.com/apache/spark/commit/858a3ec54f78b4865c19ec46aebc9320b36f02b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15275: [SPARK-17702][SQL] Code generation including too ...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/15275 [SPARK-17702][SQL] Code generation including too many mutable states exceeds JVM size limit. ## What changes were proposed in this pull request? Code generation including too many mutable states exceeds JVM size limit to extract values from `references` into fields in the constructor. We should split the generated extractions in the constructor into smaller functions. ## How was this patch tested? I added some tests to check if the generated codes for the expressions exceed or not. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-17702 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15275 commit 4c78ca91d0c0e096e5b0668f35a0d24457e36586 Author: Takuya UESHINDate: 2016-09-28T04:19:34Z Add a test to reproduce the issue. commit 858a3ec54f78b4865c19ec46aebc9320b36f02b5 Author: Takuya UESHIN Date: 2016-09-28T04:26:38Z Split wide constructor into blocks due to JVM code size limit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15239#discussion_r80841590 --- Diff: R/pkg/R/utils.R --- @@ -342,7 +342,8 @@ varargsToStrEnv <- function(...) { for (name in names(pairs)) { value <- pairs[[name]] if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) { - stop("value[", value, "] in key[", name, "] is not convertable to string.") + stop(paste0("Unsupported type for ", name, " : ", class(value), + ". Supported types are logical, numeric, character and null.")) --- End diff -- `NULL` instead of `null` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15239#discussion_r80841566 --- Diff: R/pkg/R/SQLContext.R --- @@ -328,6 +328,7 @@ setMethod("toDF", signature(x = "RDD"), #' It goes through the entire dataset once to determine the schema. #' #' @param path Path of file to read. A vector of multiple paths is allowed. +#' @param ... additional external data source specific named properties. --- End diff -- this is odd - was it not complaining about this missing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14646: [SPARK-17058] [build] Add maven snapshots-and-staging pr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14646 **[Test build #66019 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66019/consoleFull)** for PR 14646 at commit [`09d96be`](https://github.com/apache/spark/commit/09d96bec29c9ac9a58966c603c2c6b3b642eb9d2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15239#discussion_r80841401 --- Diff: R/pkg/R/DataFrame.R --- @@ -55,6 +55,19 @@ setMethod("initialize", "SparkDataFrame", function(.Object, sdf, isCached) { .Object }) +#' Set options/mode and then return the write object +#' @noRd +setWriteOptions <- function(write, path = NULL, mode = 'error', ...) { +options <- varargsToStrEnv(...) +if (!is.null(path)) { + options[["path"]] <- path +} +jmode <- convertToJSaveMode(mode) +write <- callJMethod(write, "mode", jmode) +write <- callJMethod(write, "options", options) +write --- End diff -- do you think if it make sense to have a generic write method? ie. include `write <- callJMethod(x@sdf, "write")` and `invisible(callJMethod(write, source, path))` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15239#discussion_r80841143 --- Diff: R/pkg/R/utils.R --- @@ -334,6 +334,27 @@ varargsToEnv <- function(...) { env } +# Utility function to capture the varargs into environment object but all values are converted +# into string. +varargsToStrEnv <- function(...) { + pairs <- list(...) + env <- new.env() + for (name in names(pairs)) { +value <- pairs[[name]] +if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) { + stop("value[", value, "] in key[", name, "] is not convertable to string.") --- End diff -- I think "... logical, character, numeric and NULL" as these are the R names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15273 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66014/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15273 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15232: [SPARK-17499][SPARKR][FOLLOWUP] Check null first ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15232 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15273 **[Test build #66014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66014/consoleFull)** for PR 15273 at commit [`c0e08b8`](https://github.com/apache/spark/commit/c0e08b84c5d600789fbafa38692669022ae6973d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14818: [SPARK-17157][SPARKR][WIP]: Add multiclass logistic regr...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14818 make sense, I think then we should have a separate function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15247: [SPARK-17672] Spark 2.0 history server web Ui takes too ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15247 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66013/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15247: [SPARK-17672] Spark 2.0 history server web Ui takes too ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15247 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15247: [SPARK-17672] Spark 2.0 history server web Ui takes too ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15247 **[Test build #66013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66013/consoleFull)** for PR 15247 at commit [`7e437a1`](https://github.com/apache/spark/commit/7e437a1abf9b2a9461400eecea375ae6cd6043b7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15272 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15272 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66010/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15272 **[Test build #66010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66010/consoleFull)** for PR 15272 at commit [`b65c926`](https://github.com/apache/spark/commit/b65c9263da58e703785356bc5c3a64d3e3ecbc0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15258: [SPARK-17689][SQL][STREAMING] added excludeFiles option ...
Github user frreiss commented on the issue: https://github.com/apache/spark/pull/15258 This change allows FileInputStream to consume partial outputs of a system such as Hadoop or another copy of Spark, provided that the system adheres rigidly to the write policy of recent versions of Hadoop. That is: First, write to a temporary file. Then close and flush the temporary file. Then rename the temporary file, using one of the newer, atomic HDFS APIs for renaming files. I worry that users might write data in a subtly different way that does not follow this procedure 100%, which could result in Spark reading incorrect data every once in a while. I recommend documenting under exactly what conditions the "ignore temporary files" option guarantees correct behavior. Also, it would be a good idea to include a mode in which FileInputStream will ignore a directory of files until the special file _SUCCESS appears, indicating that the directory is complete. Otherwise, Spark could end up consuming partial results from failed jobs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66012/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15263 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15263 **[Test build #66012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66012/consoleFull)** for PR 15263 at commit [`bbd7f25`](https://github.com/apache/spark/commit/bbd7f25eae18656a2c99471ab5350a3a50264aa2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15263 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66011/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15263 **[Test build #66011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66011/consoleFull)** for PR 15263 at commit [`e3c49ce`](https://github.com/apache/spark/commit/e3c49ce554afd5de2fb6e61469dc004f9f6098fa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15258: [SPARK-17689][SQL][STREAMING] added excludeFiles ...
Github user frreiss commented on a diff in the pull request: https://github.com/apache/spark/pull/15258#discussion_r80838376 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala --- @@ -50,6 +50,19 @@ class ListingFileCatalog( refresh() + /** + * Often HDFS create temporary files while copying to a new directory or writing new content. + * These files are unintentionally picked up by streaming - causing job failures. This option lets + * HDFS skip these files matching the configured regex-patterns from being picked up by Streaming + * Job. + */ + private lazy val excludeFiles: Set[String] = parameters +.getOrElse("excludeFiles", ".*._COPYING_,_temporary").split(",").toSet + + private def isExcludedFile(path: Path): Boolean = { +excludeFiles.map(path.getName.matches).fold(false)(_ || _) --- End diff -- Probably better to precompile a regex here, as this method will be called a lot (100X per second X number of files under the root directory, by default) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15271: [SPARK-17666] Ensure that RecordReaders are closed by da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15271 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15271: [SPARK-17666] Ensure that RecordReaders are closed by da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15271 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66009/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15271: [SPARK-17666] Ensure that RecordReaders are closed by da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15271 **[Test build #66009 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66009/consoleFull)** for PR 15271 at commit [`c0621db`](https://github.com/apache/spark/commit/c0621dbbd558e3715ea3b0c250913c1b94e34478). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HadoopFileLinesReader(` * `class RecordReaderIterator[T](` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15273 **[Test build #66018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66018/consoleFull)** for PR 15273 at commit [`e2dfb25`](https://github.com/apache/spark/commit/e2dfb25aeef021788208a0f05420d211916d2db5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66006/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66006 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66006/consoleFull)** for PR 15213 at commit [`f91d86f`](https://github.com/apache/spark/commit/f91d86f92a7070b0e8ed63773ecf1020975bc2fb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15274: [SPARK-17699] Support for parsing JSON string col...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15274#discussion_r80837055 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -467,3 +469,26 @@ case class JsonTuple(children: Seq[Expression]) } } +/** + * Converts an json input string to a [[StructType]] with the specified schema. + */ +case class JsonToStruct(schema: StructType, options: Map[String, String], child: Expression) --- End diff -- Ah, yes, it definitly should. Let me update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15273: [SPARK-17673] [SQL] Incorrect exchange reuse with RowDat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15273 **[Test build #66017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66017/consoleFull)** for PR 15273 at commit [`f0df3a0`](https://github.com/apache/spark/commit/f0df3a005be301ac627beb53a233ece06269ced8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15267: [SPARK-17667] [YARN][WIP]Make locking fine grained in Ya...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15267 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15267: [SPARK-17667] [YARN][WIP]Make locking fine grained in Ya...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15267 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66015/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15267: [SPARK-17667] [YARN][WIP]Make locking fine grained in Ya...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15267 **[Test build #66015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66015/consoleFull)** for PR 15267 at commit [`859718c`](https://github.com/apache/spark/commit/859718c95e3864855be98c5d09e74c652eda31b9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15274: [SPARK-17699] Support for parsing JSON string col...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15274#discussion_r80836637 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -467,3 +469,26 @@ case class JsonTuple(children: Seq[Expression]) } } +/** + * Converts an json input string to a [[StructType]] with the specified schema. + */ +case class JsonToStruct(schema: StructType, options: Map[String, String], child: Expression) --- End diff -- Should this override `ExpectsInputTypes`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14638 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14638 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66008/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14638 **[Test build #66008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66008/consoleFull)** for PR 14638 at commit [`51a83cf`](https://github.com/apache/spark/commit/51a83cfaccf967ea5329c5609dc606c1527b364b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14897 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66007/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14897 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14897 **[Test build #66007 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66007/consoleFull)** for PR 14897 at commit [`6183400`](https://github.com/apache/spark/commit/6183400275837f51a84b6259e7aa5e011d77c2ea). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/15102 FYI: #15274 adds support for parsing JSON from the key/value into a Spark SQL `StructType` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15274: [SPARK-17699] Support for parsing JSON string columns
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15274 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66016/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15274: [SPARK-17699] Support for parsing JSON string columns
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15274 **[Test build #66016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66016/consoleFull)** for PR 15274 at commit [`62f56a7`](https://github.com/apache/spark/commit/62f56a7e4529b35f58a229097b012bc984fd458f). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15274: [SPARK-17699] Support for parsing JSON string columns
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15274 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15274: [SPARK-17699] Support for parsing JSON string columns
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15274 **[Test build #66016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66016/consoleFull)** for PR 15274 at commit [`62f56a7`](https://github.com/apache/spark/commit/62f56a7e4529b35f58a229097b012bc984fd458f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15274: [SPARK-17699] Support for parsing JSON string col...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/15274 [SPARK-17699] Support for parsing JSON string columns Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema. Example usage: ```scala val df = Seq("""{"a": 1}""").toDS() val schema = new StructType().add("a", IntegerType) df.select(from_json($"value", schema) as 'json) // => [json: ] ``` This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark jsonParser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15274.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15274 commit 62f56a7e4529b35f58a229097b012bc984fd458f Author: Michael ArmbrustDate: 2016-09-28T02:49:22Z [SPARK-17699] Support for parsing JSON string columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15267: [SPARK-17667] [YARN][WIP]Make locking fine grained in Ya...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15267 **[Test build #66015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66015/consoleFull)** for PR 15267 at commit [`859718c`](https://github.com/apache/spark/commit/859718c95e3864855be98c5d09e74c652eda31b9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on the issue: https://github.com/apache/spark/pull/9 @yinxusen Looking good. I left a few small comments, and we should take care of the checking of initial model params in the read/write test now. After that, I think it will be ready to merge. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80834369 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] val jsonParams = paramMap.getOrElse(render(params.map { case ParamPair(p, v) => p.name -> parse(p.jsonEncode(v)) }.toList)) +// If the instance has an "initialModel" param and the param is defined, then the initial model +// will be saved along with the instance. +val initialModelFlag = + instance.hasParam("initialModel") && instance.isDefined(instance.getParam("initialModel")) val basicMetadata = ("class" -> cls) ~ ("timestamp" -> System.currentTimeMillis()) ~ ("sparkVersion" -> sc.version) ~ ("uid" -> uid) ~ - ("paramMap" -> jsonParams) + ("paramMap" -> jsonParams) ~ + // TODO: Figure out more robust way to detect the existing of the initialModel. --- End diff -- This test should probably be called "Initialize using a trained model" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80835092 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -17,12 +17,19 @@ package org.apache.spark.ml.clustering +import scala.util.Random + import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.Model --- End diff -- unused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80835086 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -17,12 +17,19 @@ package org.apache.spark.ml.clustering +import scala.util.Random + import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.Model import org.apache.spark.ml.linalg.{Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans} +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, KMeansModel => MLlibKMeansModel} +import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors} import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} +import org.apache.spark.sql.types.StructType --- End diff -- unused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80835233 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -17,12 +17,19 @@ package org.apache.spark.ml.clustering +import scala.util.Random + import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.Model import org.apache.spark.ml.linalg.{Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans} +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable} --- End diff -- unused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15267: [SPARK-17667] [YARN][WIP]Make locking fine grained in Ya...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/15267 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15189: [SPARK-17549][sql] Coalesce cached relation stats...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/15189#discussion_r80835147 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -232,4 +232,29 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { val columnTypes2 = List.fill(length2)(IntegerType) val columnarIterator2 = GenerateColumnAccessor.generate(columnTypes2) } + + test("SPARK-17549: cached table size should be correctly calculated") { +val data = spark.sparkContext.parallelize(1 to 10, 5).map { i => (i, i.toLong) } + .toDF("col1", "col2") +val plan = spark.sessionState.executePlan(data.logicalPlan).sparkPlan +val cached = InMemoryRelation(true, 5, MEMORY_ONLY, plan, None) + +// Materialize the data. +val expectedAnswer = data.collect() +checkAnswer(cached, expectedAnswer) + +// Check that the right size was calculated. +val expectedColSizes = expectedAnswer.size * (INT.defaultSize + LONG.defaultSize) +assert(cached.statistics.sizeInBytes === expectedColSizes) + +// Create a projection of the cached data and make sure the statistics are correct. +val projected = cached.withOutput(Seq(plan.output.last)) +assert(projected.statistics.sizeInBytes === expectedAnswer.size * LONG.defaultSize) --- End diff -- Thanks for confirming. So we should be fine with the previous patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15248: [SPARK-17671] Spark 2.0 history server summary page is s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15248 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66003/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15248: [SPARK-17671] Spark 2.0 history server summary page is s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15248 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15248: [SPARK-17671] Spark 2.0 history server summary page is s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15248 **[Test build #66003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66003/consoleFull)** for PR 15248 at commit [`6c84507`](https://github.com/apache/spark/commit/6c84507fc60fb021af3593a7b87ad6e9fb10e752). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15189: [SPARK-17549][sql] Coalesce cached relation stats...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/15189#discussion_r80834862 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -232,4 +232,29 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { val columnTypes2 = List.fill(length2)(IntegerType) val columnarIterator2 = GenerateColumnAccessor.generate(columnTypes2) } + + test("SPARK-17549: cached table size should be correctly calculated") { +val data = spark.sparkContext.parallelize(1 to 10, 5).map { i => (i, i.toLong) } + .toDF("col1", "col2") +val plan = spark.sessionState.executePlan(data.logicalPlan).sparkPlan +val cached = InMemoryRelation(true, 5, MEMORY_ONLY, plan, None) + +// Materialize the data. +val expectedAnswer = data.collect() +checkAnswer(cached, expectedAnswer) + +// Check that the right size was calculated. +val expectedColSizes = expectedAnswer.size * (INT.defaultSize + LONG.defaultSize) +assert(cached.statistics.sizeInBytes === expectedColSizes) + +// Create a projection of the cached data and make sure the statistics are correct. +val projected = cached.withOutput(Seq(plan.output.last)) +assert(projected.statistics.sizeInBytes === expectedAnswer.size * LONG.defaultSize) --- End diff -- I double checked the code. The `output` of an `InMemoryRelation` always represent the materialized dataset. So, it should not be a set of the underlying dataset's column set. When we scan this relation in `InMemoryTableScanExec`, we will push the selection to the scan. So, even we use `withOutput` in `CacheManager`'s `useCachedData`, we should be fine to still use the original stats because we are not changing the dataset. If you look at the implementation of this method ``` def useCachedData(plan: LogicalPlan): LogicalPlan = { plan transformDown { case currentFragment => lookupCachedData(currentFragment) .map(_.cachedRepresentation.withOutput(currentFragment.output)) .getOrElse(currentFragment) } } ``` `lookupCachedData` is implemented using `sameResult`. So, we are just applying a equivalent output (attributes in this output list may have cosmetic variations but they should be equivalent to the original attributes of this dataset). Although we may have different outputs, they are still representing the same dataset. So, seems it is fine if they have the same accumulator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org