[GitHub] spark issue #16828: [SPARK-19484][SQL]continue work to create hive table wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16828 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16828: [SPARK-19484][SQL]continue work to create hive table wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16828 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72488/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16828: [SPARK-19484][SQL]continue work to create hive table wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16828 **[Test build #72488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72488/testReport)** for PR 16828 at commit [`40f2896`](https://github.com/apache/spark/commit/40f28965f48c2de2cb02eb5a26a52a4ca27bda5b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/16750#discussion_r99760937 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -298,6 +299,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. + * `timeZone` (default session local timezone): sets the string that indicates a timezone --- End diff -- I'd like to use `timeZone` for the option key as the same as `spark.sql.session.timeZone` for config key for the session local timezone. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/16750#discussion_r99760946 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -31,10 +31,11 @@ import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CompressionCodecs * Most of these map directly to Jackson's internal options, specified in [[JsonParser.Feature]]. */ private[sql] class JSONOptions( -@transient private val parameters: CaseInsensitiveMap) +@transient private val parameters: CaseInsensitiveMap, defaultTimeZoneId: String) --- End diff -- I put the `timeZone` option every time creating `JSONOptions` (or `CSVOptions`), but there were the same contains-key check logic many times as @HyukjinKwon mentioned. So I modified to pass the default timezone id to `JSONOptions` and `CSVOptions`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72487/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72487/testReport)** for PR 16740 at commit [`37c41aa`](https://github.com/apache/spark/commit/37c41aa598bd0c2e3cf9e42f217233498ffdac23). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16809: [SPARK-19463][SQL]refresh cache after the InsertIntoHado...
Github user windpiger commented on the issue: https://github.com/apache/spark/pull/16809 thanks a lot! It seems that add a REFRESH command is to not modify the default behavior. if user want to refresh, they call the command manually. @gatorsmile @sameeragarwal @hvanhovell @davies let us rethink the default behavior? It is resonable to refresh after Insert auto, or use refresh manually? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16800#discussion_r99755967 --- Diff: R/pkg/inst/tests/testthat/test_mllib_classification.R --- @@ -27,6 +27,44 @@ absoluteSparkPath <- function(x) { file.path(sparkHome, x) } +test_that("spark.linearSvc", { + df <- suppressWarnings(createDataFrame(iris)) + training <- df[df$Species %in% c("versicolor", "virginica"), ] + model <- model <- spark.linearSvc(training, Species ~ ., regParam = 0.01, maxIter = 10) + summary <- summary(model) + expect_equal(summary$coefficients, list(-0.1563083, -0.460648, 0.2276626, 1.055085), + tolerance = 1e-2) + expect_equal(summary$intercept, -0.06004978, tolerance = 1e-2) + + # Test prediction with string label + prediction <- predict(model, training) + expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "character") + expected <- c("versicolor", "versicolor", "versicolor", "virginica", "virginica", +"virginica", "virginica", "virginica", "virginica", "virginica") + expect_equal(sort(as.list(take(select(prediction, "prediction"), 10))[[1]]), expected) --- End diff -- Add `sort` to make it stable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16800#discussion_r99755912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.classification.{LinearSVC, LinearSVCModel} +import org.apache.spark.ml.feature.{IndexToString, RFormula} +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class LinearSVCWrapper private ( +val pipeline: PipelineModel, +val features: Array[String], +val labels: Array[String]) extends MLWritable { + import LinearSVCWrapper._ + + private val svcModel: LinearSVCModel = +pipeline.stages(1).asInstanceOf[LinearSVCModel] --- End diff -- The last state is id_to_index. So I need to use stages(1) to get the fitted model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16800: [SPARK-19456][SparkR][WIP]:Add LinearSVC R API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16800 **[Test build #72492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72492/testReport)** for PR 16800 at commit [`e6eea1d`](https://github.com/apache/spark/commit/e6eea1df1acd3d01ea3c989a6cfe0823a4f99fb2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16800: [SPARK-19456][SparkR][WIP]:Add LinearSVC R API
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16800#discussion_r99755781 --- Diff: R/pkg/R/generics.R --- @@ -1376,6 +1376,10 @@ setGeneric("spark.kstest", function(data, ...) { standardGeneric("spark.kstest") #' @export setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") }) +#' @rdname spark.linearSvc +#' @export +setGeneric("spark.linearSvc", function(data, formula, ...) { standardGeneric("spark.linearSvc") }) --- End diff -- `svmLinear` looks fine. I will change the files tomorrow. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical? I can take a look at how Spark handled this prior to 2.1, although I'm not sure if the behavior we'll see there was the result of a conscious decision or "undefined" behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > how about we add a new SQL command to refresh the table schema in metastore by inferring schema with data files? This is a compatibility issue and we should have provided a way for users to migrate, before the 2.1 release. I think this approach is much simpler than adding a flag. While I think introducing a command for inferring and storing the table's case-sensitive schema as a property would be a welcome addition, I think requiring this property to be there in order for Spark SQL to function properly with case-sensitive data files could really restrict the settings Spark SQL can be used in. If a user wanted to use Spark SQL to query over an existing warehouse containing hundreds or even thousands of tables, under the suggested approach a Spark job would have to be run to infer the schema of each and every table. file formats such as Parquet store their schemas as metadata there still could potentially be millions of files to inspect for the warehouse. A less amenable format like JSON might require scanning all the data in the warehouse. This also doesn't cover the use case @viirya pointed our where the user may not have write access to the metastore they are querying against. In this case, the user would have to rely on the warehouse administrator to create the Spark schema property for every table they wish to query. > For tables created by hive, as hive is a case-insensitive system, will the parquet files have mixed-case schema? I think the Hive Metastore has become a bit of an open standard for maintaining a data warehouse catalog since so many tools integrate with it. I wouldn't assume that the underlying data pointed to by an external metastore was created or managed by Hive itself. For example, we maintain a Hive Metastore that catalogs case-sensitive files written by our Spark-based ETL pipeline, which parses case classes from string data and writes them as Parquet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16376: [SPARK-18967][SCHEDULER] compute locality levels even if...
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/16376 Awesome always enthusiastic about fixing minor nits!! I merged this into master. I didn't merge it into 2.1 but I don't feel strongly about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16376: [SPARK-18967][SCHEDULER] compute locality levels ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16376 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16476: [SPARK-19084][SQL] Implement expression field
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/16476#discussion_r99753353 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -20,8 +20,12 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.TypeCheckResult import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.catalyst.util.TypeUtils import org.apache.spark.sql.types._ +import scala.annotation.tailrec --- End diff -- Sorry, it's my fault. I will run dev/scalastyle first before I push my changes next time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16497: [SPARK-19118] [SQL] Percentile support for frequency dis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16497 **[Test build #72491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72491/testReport)** for PR 16497 at commit [`58e7a63`](https://github.com/apache/spark/commit/58e7a635d2d138fe6809f6bb6898322e1ea956e0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r99753046 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -125,10 +139,17 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequencyExpression.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { + val frqLong = frqValue.asInstanceOf[Number].longValue() + // add only when frequency is positive + if (frqLong > 0) { +buffer.changeValue(key, frqLong, _ + frqLong) + } else if ( frqLong < 0 ) { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrary state...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/16758 I addressed all the comments. However, @zsxwing, our offline discussion of throwing error on `.update(null)` ran into a problem. Since its typed as S, the behavior is odd when S is primitive type. See the failing test. When the type is Int, `get` return 0 when the state does not exist. That's very non-intuitive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrary state...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16758 **[Test build #72490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72490/testReport)** for PR 16758 at commit [`b32dcd1`](https://github.com/apache/spark/commit/b32dcd170a55c655c292d2cc535893c02a86c296). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 The proposal to restore schema inference with finer grained control on when it is performed sounds reasonable to me. The case I'm most interested in is turning off schema inference entirely, because we do not use parquet files with upper-case characters in their column names. BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16827 Working on UT failure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16827 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72486/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16827 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16827 **[Test build #72486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72486/testReport)** for PR 16827 at commit [`8198db3`](https://github.com/apache/spark/commit/8198db36574a1eb122c2d2303058b9424006f62e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16828: [SPARK-19484][SQL]continue work to create hive table wit...
Github user windpiger commented on the issue: https://github.com/apache/spark/pull/16828 cc @gatorsmile @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16376: [SPARK-18967][SCHEDULER] compute locality levels even if...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16376 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72484/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16476: [SPARK-19084][SQL] Implement expression field
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16476 **[Test build #72489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72489/testReport)** for PR 16476 at commit [`a8d22e3`](https://github.com/apache/spark/commit/a8d22e382e07b1bc6acccded13648ab0d22ed255). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16376: [SPARK-18967][SCHEDULER] compute locality levels even if...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16376 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16828: [SPARK-19484][SQL]continue work to create hive table wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16828 **[Test build #72488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72488/testReport)** for PR 16828 at commit [`40f2896`](https://github.com/apache/spark/commit/40f28965f48c2de2cb02eb5a26a52a4ca27bda5b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16376: [SPARK-18967][SCHEDULER] compute locality levels even if...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16376 **[Test build #72484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72484/testReport)** for PR 16376 at commit [`cf8271e`](https://github.com/apache/spark/commit/cf8271e4c69efa573aabf891d5465736c7c79184). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16828: [SPARK-19484][SQL]continue work to create hive ta...
GitHub user windpiger opened a pull request: https://github.com/apache/spark/pull/16828 [SPARK-19484][SQL]continue work to create hive table with an empty schema ## What changes were proposed in this pull request? after [SPARK-19279](https://issues.apache.org/jira/browse/SPARK-19279), we could not create a Hive table with an empty schema, we should tighten up the condition when create a hive table in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 That is if a CatalogTable t has an empty schema, and (there is no `spark.sql.schema.numParts` or its value is 0), we should not add a default `col` schema, if we did, a table with an empty schema will be created, that is not we expected. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/windpiger/spark improveCreateTableWithEmptySchema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16828.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16828 commit 40f28965f48c2de2cb02eb5a26a52a4ca27bda5b Author: windpigerDate: 2017-02-07T06:00:46Z [SPARK-19484][SQL]continue work to create hive table with an empty schema --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16819: [SPARK-16441][YARN] Set maxNumExecutor depends on yarn c...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/16819 It will reduce the function call on [CoarseGrainedSchedulerBackend.requestTotalExecutors()](https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L491) after apply this PR: [before](https://github.com/apache/spark/files/756685/doNotApply-PR-16819.txt) [after-apply-this-PR](https://github.com/apache/spark/files/756684/apply-PR-16819.txt) Full log can be found [here](https://issues.apache.org/jira/secure/attachment/12851318/SPARK-16441-compare-apply-PR-16819.zip). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16762: [SPARK-19419] [SPARK-19420] Fix the cross join de...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16762#discussion_r99749089 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -213,7 +213,12 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { } // This join could be very slow or OOM joins.BroadcastNestedLoopJoinExec( --- End diff -- when we broadcast, is it still a cross join? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99749006 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/KeyedStateImpl.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import org.apache.spark.sql.KeyedState + +/** Internal implementation of the [[KeyedState]] interface */ --- End diff -- i have mentioned that the trait KeyedState is not thread-safe --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99749020 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -313,6 +313,56 @@ case class MapGroups( outputObjAttr: Attribute, child: LogicalPlan) extends UnaryNode with ObjectProducer +/** Internal class representing State */ +trait LogicalKeyedState[S] + +/** Factory for constructing new `MapGroupsWithState` nodes. */ +object MapGroupsWithState { + def apply[K: Encoder, V: Encoder, S: Encoder, U: Encoder]( + func: (Any, Iterator[Any], LogicalKeyedState[Any]) => Iterator[Any], + groupingAttributes: Seq[Attribute], + dataAttributes: Seq[Attribute], + child: LogicalPlan): LogicalPlan = { +val mapped = new MapGroupsWithState( + func, + UnresolvedDeserializer(encoderFor[K].deserializer, groupingAttributes), + UnresolvedDeserializer(encoderFor[V].deserializer, dataAttributes), + groupingAttributes, + dataAttributes, + CatalystSerde.generateObjAttr[U], + encoderFor[S].resolveAndBind().deserializer, + encoderFor[S].namedExpressions, + child) +CatalystSerde.serialize[U](mapped) + } +} + +/** + * Applies func to each unique group in `child`, based on the evaluation of `groupingAttributes`, + * while using state data. + * Func is invoked with an object representation of the grouping key an iterator containing the + * object representation of all the rows with that key. + * + * @param keyDeserializer used to extract the key object for each group. + * @param valueDeserializer used to extract the items in the iterator from an input row. + * @param groupingAttributes used to group the data + * @param dataAttributes used to read the data + * @param outputObjAttr used to define the output object + * @param stateDeserializer used to deserialize state before calling `func` + * @param stateSerializer used to serialize updated state after calling `func` + */ +case class MapGroupsWithState( +func: (Any, Iterator[Any], LogicalKeyedState[Any]) => Iterator[Any], +keyDeserializer: Expression, +valueDeserializer: Expression, +groupingAttributes: Seq[Attribute], +dataAttributes: Seq[Attribute], +outputObjAttr: Attribute, +stateDeserializer: Expression, +stateSerializer: Seq[NamedExpression], +child: LogicalPlan) extends UnaryNode with ObjectProducer + + --- End diff -- removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748948 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyedState.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.{Experimental, InterfaceStability} +import org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState + +/** + * :: Experimental :: + * + * Wrapper class for interacting with keyed state data in `mapGroupsWithState` and + * `flatMapGroupsWithState` operations on + * [[KeyValueGroupedDataset]]. + * + * Detail description on `[map/flatMap]GroupsWithState` operation + * + * Both, `mapGroupsWithState` and `flatMapGroupsWithState` in [[KeyValueGroupedDataset]] + * will invoke the user-given function on each group (defined by the grouping function in + * `Dataset.groupByKey()`) while maintaining user-defined per-group state between invocations. + * For a static batch Dataset, the function will be invoked once per group. For a streaming + * Dataset, the function will be invoked for each group repeatedly in every trigger. + * That is, in every batch of the [[streaming.StreamingQuery StreamingQuery]], + * the function will be invoked once for each group that has data in the batch. + * + * The function is invoked with following parameters. + * - The key of the group. + * - An iterator containing all the values for this key. + * - A user-defined state object set by previous invocations of the given function. + * In case of a batch Dataset, there is only invocation and state object will be empty as + * there is no prior state. Essentially, for batch Datasets, `[map/flatMap]GroupsWithState` + * is equivalent to `[map/flatMap]Groups`. + * + * Important points to note about the function. + * - In a trigger, the function will be called only the groups present in the batch. So do not + *assume that the function will be called in every trigger for every group that has state. + * - There is no guaranteed ordering of values in the iterator in the function, neither with + *batch, nor with streaming Datasets. + * - All the data will be shuffled before applying the function. + * + * Important points to note about using KeyedState. + * - The value of the state cannot be null. So updating state with null is same as removing it. + * - Operations on `KeyedState` are not thread-safe. This is to avoid memory barriers. + * - If the `remove()` is called, then `exists()` will return `false`, and --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16762: [SPARK-19419] [SPARK-19420] Fix the cross join de...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16762#discussion_r99748912 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala --- @@ -339,6 +340,33 @@ case class BroadcastNestedLoopJoinExec( ) } + protected override def doPrepare(): Unit = { +if (!sqlContext.conf.crossJoinEnabled) { + joinType match { +case Cross => // Do nothing +case Inner => --- End diff -- can you add comments to say when we will hit this branch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748834 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala --- @@ -46,8 +46,13 @@ object UnsupportedOperationChecker { "Queries without streaming sources cannot be executed with writeStream.start()")(plan) } +/** Collect all the streaming aggregates in a sub plan */ +def collectStreamingAggregates(subplan: LogicalPlan): Seq[Aggregate] = { + subplan.collect { case a@Aggregate(_, _, _) if a.isStreaming => a } --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16762: [SPARK-19419] [SPARK-19420] Fix the cross join de...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16762#discussion_r99748730 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala --- @@ -339,6 +340,33 @@ case class BroadcastNestedLoopJoinExec( ) } + protected override def doPrepare(): Unit = { +if (!sqlContext.conf.crossJoinEnabled) { + joinType match { +case Cross => // Do nothing +case Inner => + if (condition.isEmpty) { +throw new AnalysisException( + s"""Detected cartesian product for INNER join between logical plans + |${left.treeString(false).trim} + |and + |${right.treeString(false).trim} + |Join condition is missing or trivial. + |Use the CROSS JOIN syntax to allow cartesian products between these relations. + """.stripMargin) + } +case _ if !withinBroadcastThreshold || condition.isEmpty => + throw new AnalysisException( +s"""Both sides of this join are outside the broadcasting threshold and + |computing it could be prohibitively expensive. To explicitly enable it, + |Please set ${SQLConf.CROSS_JOINS_ENABLED.key} = true. --- End diff -- shall we support `CROSS OUTER JOIN`, `CROSS LEFT JOIN`, etc? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748711 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala --- @@ -235,3 +234,79 @@ case class StateStoreSaveExec( override def outputPartitioning: Partitioning = child.outputPartitioning } + + +/** Physical operator for executing streaming mapGroupsWithState. */ +case class MapGroupsWithStateExec( +func: (Any, Iterator[Any], LogicalKeyedState[Any]) => Iterator[Any], +keyDeserializer: Expression, +valueDeserializer: Expression, +groupingAttributes: Seq[Attribute], +dataAttributes: Seq[Attribute], +outputObjAttr: Attribute, +stateId: Option[OperatorStateId], +stateDeserializer: Expression, +stateSerializer: Seq[NamedExpression], +child: SparkPlan) extends UnaryExecNode with ObjectProducerExec with StateStoreWriter { + + override def outputPartitioning: Partitioning = child.outputPartitioning + + /** Distribute by grouping attributes */ + override def requiredChildDistribution: Seq[Distribution] = +ClusteredDistribution(groupingAttributes) :: Nil + + /** Ordering needed for using GroupingIterator */ + override def requiredChildOrdering: Seq[Seq[SortOrder]] = +Seq(groupingAttributes.map(SortOrder(_, Ascending))) + + override protected def doExecute(): RDD[InternalRow] = { +child.execute().mapPartitionsWithStateStore[InternalRow]( + getStateId.checkpointLocation, + operatorId = getStateId.operatorId, + storeVersion = getStateId.batchId, + groupingAttributes.toStructType, + child.output.toStructType, + sqlContext.sessionState, + Some(sqlContext.streams.stateStoreCoordinator)) { (store, iter) => +val numTotalStateRows = longMetric("numTotalStateRows") +val numUpdatedStateRows = longMetric("numUpdatedStateRows") +val numOutputRows = longMetric("numOutputRows") + +val groupedIter = GroupedIterator(iter, groupingAttributes, child.output) + +val getKeyObj = ObjectOperator.deserializeRowToObject(keyDeserializer, groupingAttributes) +val getKey = GenerateUnsafeProjection.generate(groupingAttributes, child.output) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748668 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala --- @@ -58,6 +58,8 @@ trait StateStore { */ def remove(condition: UnsafeRow => Boolean): Unit + def remove(key: UnsafeRow): Unit --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyedState.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.{Experimental, InterfaceStability} +import org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState + +/** + * :: Experimental :: + * + * Wrapper class for interacting with keyed state data in `mapGroupsWithState` and + * `flatMapGroupsWithState` operations on + * [[KeyValueGroupedDataset]]. + * + * Detail description on `[map/flatMap]GroupsWithState` operation + * + * Both, `mapGroupsWithState` and `flatMapGroupsWithState` in [[KeyValueGroupedDataset]] + * will invoke the user-given function on each group (defined by the grouping function in + * `Dataset.groupByKey()`) while maintaining user-defined per-group state between invocations. + * For a static batch Dataset, the function will be invoked once per group. For a streaming + * Dataset, the function will be invoked for each group repeatedly in every trigger. + * That is, in every batch of the [[streaming.StreamingQuery StreamingQuery]], + * the function will be invoked once for each group that has data in the batch. + * + * The function is invoked with following parameters. + * - The key of the group. + * - An iterator containing all the values for this key. + * - A user-defined state object set by previous invocations of the given function. + * In case of a batch Dataset, there is only invocation and state object will be empty as + * there is no prior state. Essentially, for batch Datasets, `[map/flatMap]GroupsWithState` + * is equivalent to `[map/flatMap]Groups`. + * + * Important points to note about the function. + * - In a trigger, the function will be called only the groups present in the batch. So do not + *assume that the function will be called in every trigger for every group that has state. + * - There is no guaranteed ordering of values in the iterator in the function, neither with + *batch, nor with streaming Datasets. + * - All the data will be shuffled before applying the function. + * + * Important points to note about using KeyedState. + * - The value of the state cannot be null. So updating state with null is same as removing it. + * - Operations on `KeyedState` are not thread-safe. This is to avoid memory barriers. + * - If the `remove()` is called, then `exists()` will return `false`, and + *`getOption()` will return `None`. + * - After that `update(newState)` is called, then `exists()` will return `true`, + *and `getOption()` will return `Some(...)`. + * + * Scala example of using `KeyedState` in `mapGroupsWithState`: + * {{{ + * // A mapping function that maintains an integer state for string keys and returns a string. + * def mappingFunction(key: String, value: Iterable[Int], state: KeyedState[Int]): Option[String]= { --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748491 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyedState.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.{Experimental, InterfaceStability} +import org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState + +/** + * :: Experimental :: + * + * Wrapper class for interacting with keyed state data in `mapGroupsWithState` and + * `flatMapGroupsWithState` operations on + * [[KeyValueGroupedDataset]]. + * + * Detail description on `[map/flatMap]GroupsWithState` operation + * + * Both, `mapGroupsWithState` and `flatMapGroupsWithState` in [[KeyValueGroupedDataset]] + * will invoke the user-given function on each group (defined by the grouping function in + * `Dataset.groupByKey()`) while maintaining user-defined per-group state between invocations. + * For a static batch Dataset, the function will be invoked once per group. For a streaming + * Dataset, the function will be invoked for each group repeatedly in every trigger. + * That is, in every batch of the [[streaming.StreamingQuery StreamingQuery]], + * the function will be invoked once for each group that has data in the batch. + * + * The function is invoked with following parameters. + * - The key of the group. + * - An iterator containing all the values for this key. + * - A user-defined state object set by previous invocations of the given function. + * In case of a batch Dataset, there is only invocation and state object will be empty as + * there is no prior state. Essentially, for batch Datasets, `[map/flatMap]GroupsWithState` + * is equivalent to `[map/flatMap]Groups`. + * + * Important points to note about the function. + * - In a trigger, the function will be called only the groups present in the batch. So do not + *assume that the function will be called in every trigger for every group that has state. + * - There is no guaranteed ordering of values in the iterator in the function, neither with + *batch, nor with streaming Datasets. + * - All the data will be shuffled before applying the function. + * + * Important points to note about using KeyedState. + * - The value of the state cannot be null. So updating state with null is same as removing it. + * - Operations on `KeyedState` are not thread-safe. This is to avoid memory barriers. + * - If the `remove()` is called, then `exists()` will return `false`, and + *`getOption()` will return `None`. + * - After that `update(newState)` is called, then `exists()` will return `true`, + *and `getOption()` will return `Some(...)`. + * + * Scala example of using `KeyedState` in `mapGroupsWithState`: + * {{{ + * // A mapping function that maintains an integer state for string keys and returns a string. + * def mappingFunction(key: String, value: Iterable[Int], state: KeyedState[Int]): Option[String]= { + * // Check if state exists + * if (state.exists) { + * val existingState = state.get // Get the existing state + * val shouldRemove = ... // Decide whether to remove the state + * if (shouldRemove) { + * state.remove() // Remove the state + * } else { + * val newState = ... + * state.update(newState)// Set the new state + * } + * } else { + * val initialState = ... + * state.update(initialState) // Set the initial state + * } + * ... // return something + * } + * + * }}} + * + * Java example of using `KeyedState`: + * {{{ + * // A mapping function that maintains an integer state for string keys and returns a string. + * MapGroupsWithStateFunction
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748329 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyedState.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.{Experimental, InterfaceStability} +import org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState + +/** + * :: Experimental :: + * + * Wrapper class for interacting with keyed state data in `mapGroupsWithState` and + * `flatMapGroupsWithState` operations on + * [[KeyValueGroupedDataset]]. + * + * Detail description on `[map/flatMap]GroupsWithState` operation + * + * Both, `mapGroupsWithState` and `flatMapGroupsWithState` in [[KeyValueGroupedDataset]] + * will invoke the user-given function on each group (defined by the grouping function in + * `Dataset.groupByKey()`) while maintaining user-defined per-group state between invocations. + * For a static batch Dataset, the function will be invoked once per group. For a streaming + * Dataset, the function will be invoked for each group repeatedly in every trigger. + * That is, in every batch of the [[streaming.StreamingQuery StreamingQuery]], + * the function will be invoked once for each group that has data in the batch. + * + * The function is invoked with following parameters. + * - The key of the group. + * - An iterator containing all the values for this key. + * - A user-defined state object set by previous invocations of the given function. + * In case of a batch Dataset, there is only invocation and state object will be empty as --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748256 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala --- @@ -111,6 +111,25 @@ class UnsupportedOperationsSuite extends SparkFunSuite { outputMode = Complete, expectedMsgs = Seq("distinct aggregation")) + // MapGroupsWithState: Not supported after a streaming aggregation + val att = new AttributeReference(name = "a", dataType = LongType)() + assertSupportedInStreamingPlan( --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748308 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyedState.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.{Experimental, InterfaceStability} +import org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState + +/** + * :: Experimental :: + * + * Wrapper class for interacting with keyed state data in `mapGroupsWithState` and + * `flatMapGroupsWithState` operations on + * [[KeyValueGroupedDataset]]. + * + * Detail description on `[map/flatMap]GroupsWithState` operation + * + * Both, `mapGroupsWithState` and `flatMapGroupsWithState` in [[KeyValueGroupedDataset]] + * will invoke the user-given function on each group (defined by the grouping function in + * `Dataset.groupByKey()`) while maintaining user-defined per-group state between invocations. + * For a static batch Dataset, the function will be invoked once per group. For a streaming + * Dataset, the function will be invoked for each group repeatedly in every trigger. + * That is, in every batch of the [[streaming.StreamingQuery StreamingQuery]], + * the function will be invoked once for each group that has data in the batch. + * + * The function is invoked with following parameters. + * - The key of the group. + * - An iterator containing all the values for this key. + * - A user-defined state object set by previous invocations of the given function. + * In case of a batch Dataset, there is only invocation and state object will be empty as + * there is no prior state. Essentially, for batch Datasets, `[map/flatMap]GroupsWithState` + * is equivalent to `[map/flatMap]Groups`. + * + * Important points to note about the function. + * - In a trigger, the function will be called only the groups present in the batch. So do not + *assume that the function will be called in every trigger for every group that has state. + * - There is no guaranteed ordering of values in the iterator in the function, neither with + *batch, nor with streaming Datasets. + * - All the data will be shuffled before applying the function. + * + * Important points to note about using KeyedState. + * - The value of the state cannot be null. So updating state with null is same as removing it. + * - Operations on `KeyedState` are not thread-safe. This is to avoid memory barriers. + * - If the `remove()` is called, then `exists()` will return `false`, and + *`getOption()` will return `None`. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99748260 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/MapGroupsWithStateSuite.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.KeyedState +import org.apache.spark.sql.catalyst.streaming.InternalOutputModes._ +import org.apache.spark.sql.execution.streaming.{KeyedStateImpl, MemoryStream} +import org.apache.spark.sql.execution.streaming.state.StateStore + +/** Class to check custom state types */ +case class RunningCount(count: Long) + +class MapGroupsWithStateSuite extends StreamTest with BeforeAndAfterAll { + + import testImplicits._ + + override def afterAll(): Unit = { +super.afterAll() +StateStore.stop() + } + + test("state - get, exists, update, remove") { +var state: KeyedStateImpl[String] = null + +def testState( +expectedData: Option[String], +shouldBeUpdated: Boolean = false, +shouldBeRemoved: Boolean = false + ): Unit = { + if (expectedData.isDefined) { +assert(state.exists) +assert(state.get === expectedData.get) + } else { +assert(!state.exists) +assert(state.get === null) + } + assert(state.isUpdated === shouldBeUpdated) + assert(state.isRemoved === shouldBeRemoved) +} + +// Updating empty state +state = KeyedStateImpl[String](null) +testState(None) +state.update("") +testState(Some(""), shouldBeUpdated = true) + +// Updating exiting state +state = KeyedStateImpl[String]("2") +testState(Some("2")) +state.update("3") +testState(Some("3"), shouldBeUpdated = true) + +// Removing state +state.remove() +testState(None, shouldBeRemoved = true, shouldBeUpdated = false) +state.remove() // should be still callable +state.update("4") +testState(Some("4"), shouldBeRemoved = false, shouldBeUpdated = true) + +// Updating by null is same as remove +state.update(null) +testState(None, shouldBeRemoved = true, shouldBeUpdated = false) + } + + test("flatMapGroupsWithState - streaming") { +// Function to maintain running count up to 2, and then remove the count +// Returns the data and the count if state is defined, otherwise does not return anything +val stateFunc = (key: String, values: Iterator[String], state: KeyedState[RunningCount]) => { + + var count = Option(state.get).map(_.count).getOrElse(0L) + values.size --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99747470 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/MapGroupsWithStateSuite.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.KeyedState +import org.apache.spark.sql.catalyst.streaming.InternalOutputModes._ +import org.apache.spark.sql.execution.streaming.{KeyedStateImpl, MemoryStream} +import org.apache.spark.sql.execution.streaming.state.StateStore + +/** Class to check custom state types */ +case class RunningCount(count: Long) + +class MapGroupsWithStateSuite extends StreamTest with BeforeAndAfterAll { + + import testImplicits._ + + override def afterAll(): Unit = { +super.afterAll() +StateStore.stop() + } + + test("state - get, exists, update, remove") { +var state: KeyedStateImpl[String] = null + +def testState( +expectedData: Option[String], +shouldBeUpdated: Boolean = false, +shouldBeRemoved: Boolean = false + ): Unit = { + if (expectedData.isDefined) { +assert(state.exists) +assert(state.get === expectedData.get) + } else { +assert(!state.exists) +assert(state.get === null) + } + assert(state.isUpdated === shouldBeUpdated) + assert(state.isRemoved === shouldBeRemoved) +} + +// Updating empty state +state = KeyedStateImpl[String](null) +testState(None) +state.update("") +testState(Some(""), shouldBeUpdated = true) + +// Updating exiting state +state = KeyedStateImpl[String]("2") +testState(Some("2")) +state.update("3") +testState(Some("3"), shouldBeUpdated = true) + +// Removing state +state.remove() +testState(None, shouldBeRemoved = true, shouldBeUpdated = false) +state.remove() // should be still callable +state.update("4") +testState(Some("4"), shouldBeRemoved = false, shouldBeUpdated = true) + +// Updating by null is same as remove +state.update(null) +testState(None, shouldBeRemoved = true, shouldBeUpdated = false) + } + + test("flatMapGroupsWithState - streaming") { +// Function to maintain running count up to 2, and then remove the count +// Returns the data and the count if state is defined, otherwise does not return anything +val stateFunc = (key: String, values: Iterator[String], state: KeyedState[RunningCount]) => { + + var count = Option(state.get).map(_.count).getOrElse(0L) + values.size + if (count == 3) { +state.remove() +Iterator.empty + } else { +state.update(RunningCount(count)) +Iterator((key, count.toString)) + } +} + +val inputData = MemoryStream[String] +val result = + inputData.toDS() +.groupByKey(x => x) +.flatMapGroupsWithState(stateFunc) // State: Int, Out: (Str, Str) + +testStream(result, Append)( + AddData(inputData, "a"), + CheckLastBatch(("a", "1")), + assertNumStateRows(total = 1, updated = 1), + AddData(inputData, "a", "b"), + CheckLastBatch(("a", "2"), ("b", "1")), + assertNumStateRows(total = 2, updated = 2), + StopStream, + StartStream(), + AddData(inputData, "a", "b"), // should remove state for "a" and not return anything for a + CheckLastBatch(("b", "2")), + assertNumStateRows(total = 1, updated = 2), + StopStream, + StartStream(), + AddData(inputData, "a", "c"), // should recreate state for "a" and return count as 1 and +
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99747455 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/MapGroupsWithStateSuite.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.KeyedState +import org.apache.spark.sql.catalyst.streaming.InternalOutputModes._ +import org.apache.spark.sql.execution.streaming.{KeyedStateImpl, MemoryStream} +import org.apache.spark.sql.execution.streaming.state.StateStore + +/** Class to check custom state types */ +case class RunningCount(count: Long) + +class MapGroupsWithStateSuite extends StreamTest with BeforeAndAfterAll { + + import testImplicits._ + + override def afterAll(): Unit = { +super.afterAll() +StateStore.stop() + } + + test("state - get, exists, update, remove") { +var state: KeyedStateImpl[String] = null + +def testState( +expectedData: Option[String], +shouldBeUpdated: Boolean = false, +shouldBeRemoved: Boolean = false + ): Unit = { + if (expectedData.isDefined) { +assert(state.exists) +assert(state.get === expectedData.get) + } else { +assert(!state.exists) +assert(state.get === null) + } + assert(state.isUpdated === shouldBeUpdated) + assert(state.isRemoved === shouldBeRemoved) +} + +// Updating empty state +state = KeyedStateImpl[String](null) +testState(None) +state.update("") +testState(Some(""), shouldBeUpdated = true) + +// Updating exiting state +state = KeyedStateImpl[String]("2") +testState(Some("2")) +state.update("3") +testState(Some("3"), shouldBeUpdated = true) + +// Removing state +state.remove() +testState(None, shouldBeRemoved = true, shouldBeUpdated = false) +state.remove() // should be still callable +state.update("4") +testState(Some("4"), shouldBeRemoved = false, shouldBeUpdated = true) + +// Updating by null is same as remove +state.update(null) +testState(None, shouldBeRemoved = true, shouldBeUpdated = false) + } + + test("flatMapGroupsWithState - streaming") { +// Function to maintain running count up to 2, and then remove the count +// Returns the data and the count if state is defined, otherwise does not return anything +val stateFunc = (key: String, values: Iterator[String], state: KeyedState[RunningCount]) => { + + var count = Option(state.get).map(_.count).getOrElse(0L) + values.size + if (count == 3) { +state.remove() +Iterator.empty + } else { +state.update(RunningCount(count)) +Iterator((key, count.toString)) + } +} + +val inputData = MemoryStream[String] +val result = + inputData.toDS() +.groupByKey(x => x) +.flatMapGroupsWithState(stateFunc) // State: Int, Out: (Str, Str) + +testStream(result, Append)( + AddData(inputData, "a"), + CheckLastBatch(("a", "1")), + assertNumStateRows(total = 1, updated = 1), + AddData(inputData, "a", "b"), + CheckLastBatch(("a", "2"), ("b", "1")), + assertNumStateRows(total = 2, updated = 2), + StopStream, + StartStream(), + AddData(inputData, "a", "b"), // should remove state for "a" and not return anything for a + CheckLastBatch(("b", "2")), + assertNumStateRows(total = 1, updated = 2), + StopStream, + StartStream(), + AddData(inputData, "a", "c"), // should recreate state for "a" and return count as 1 and +
[GitHub] spark pull request #16758: [SPARK-19413][SS] MapGroupsWithState for arbitrar...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16758#discussion_r99747354 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/KeyedStateImpl.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import org.apache.spark.sql.KeyedState + +/** Internal implementation of the [[KeyedState]] interface */ +private[sql] case class KeyedStateImpl[S](private var value: S) extends KeyedState[S] { + private var updated: Boolean = false // whether value has been updated (but not removed) + private var removed: Boolean = false // whether value has been removed + + // = Public API = + override def exists: Boolean = { value != null } + + override def get: S = value + + override def update(newValue: S): Unit = { +if (newValue == null) { + remove() +} else { + value = newValue + updated = true + removed = false +} + } + + override def remove(): Unit = { +value = null.asInstanceOf[S] +updated = false +removed = true + } + + override def toString: String = "KeyedState($value)" --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16686: [SPARK-18682][SS] Batch Source for Kafka
Github user tdas commented on the issue: https://github.com/apache/spark/pull/16686 @zsxwing please merge if you think your concerns were addressed correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16686: [SPARK-18682][SS] Batch Source for Kafka
Github user tdas commented on the issue: https://github.com/apache/spark/pull/16686 LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16762: [SPARK-19419] [SPARK-19420] Fix the cross join detection
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16762 Cross join is a physical concept. In Spark 2.0, we detected it like what this PR did. In Spark 2.1, we moved it to Optimizer. Basically, this PR is to change it back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16815: [SPARK-19407][SS] defaultFS is used FileSystem.ge...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16815 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @sethah Thanks for the comments. OK, added more tests to cover all families. It's not possible to test all family and link combination if that's what you mean: the tweedie family supports a family of links. Now, the link tested includes identity, log, inverse, logit and mu^0.4. This should be enough to prevent any non-general changes to accidentally pass the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72487/testReport)** for PR 16740 at commit [`37c41aa`](https://github.com/apache/spark/commit/37c41aa598bd0c2e3cf9e42f217233498ffdac23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16815: [SPARK-19407][SS] defaultFS is used FileSystem.get inste...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16815 LGTM. Merging to master and 2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16827 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16827 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72485/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16827 **[Test build #72485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72485/testReport)** for PR 16827 at commit [`b50a0bd`](https://github.com/apache/spark/commit/b50a0bd4b6df3aae7a8226e216ab46c1ea393c99). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16827 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16827 **[Test build #72483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72483/testReport)** for PR 16827 at commit [`0cc5997`](https://github.com/apache/spark/commit/0cc599708d213c6aeaf8ad1a748323980444eb15). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16827 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72483/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16747: SPARK-16636 Add CalendarIntervalType to documentation
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16747 CC @rxin, if we are going to expose `CalendarInterval` and `CalendarIntervalType` officially, shall we move `CalendarInterval` to the same package as `Decimal`, or create a new class as the external representation of `CalendarIntervalType`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72481/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16744 **[Test build #72481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72481/testReport)** for PR 16744 at commit [`5823740`](https://github.com/apache/spark/commit/58237404c1dc755e4085d2cfaff7e629165f70d7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72480/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16744 **[Test build #72480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72480/testReport)** for PR 16744 at commit [`d23365b`](https://github.com/apache/spark/commit/d23365bcb0cefba80bb6b6e56d55e3319f9cf432). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16747: SPARK-16636 Add CalendarIntervalType to documentation
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16747 Then, It looks okay to me as describing the current state and I just checked it after building the doc with this, and also we can already use it as below: ```scala scala> sql("SELECT interval 1 second").schema(0).dataType.getClass res0: Class[_ <: org.apache.spark.sql.types.DataType] = class org.apache.spark.sql.types.CalendarIntervalType$ scala> sql("SELECT interval 1 second").collect()(0).get(0).getClass res1: Class[_] = class org.apache.spark.unsafe.types.CalendarInterval ``` ```scala scala> val rdd = spark.sparkContext.parallelize(Seq(Row(new CalendarInterval(0, 0 rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :32 scala> spark.createDataFrame(rdd, StructType(StructField("a", CalendarIntervalType) :: Nil)) res1: org.apache.spark.sql.DataFrame = [a: calendarinterval] ``` Another meta concern is, `org.apache.spark.unsafe.types.CalendarInterval` seems undocumented in both scaladoc/javadoc (entire `unsafe` module). Once we document this as a weak promise for this API, then we might have to keep this for backward compatibility. Maybe just describe it as SQL dedicated type or not supported for now with some `Note:` rather than describing it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16827 **[Test build #72486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72486/testReport)** for PR 16827 at commit [`8198db3`](https://github.com/apache/spark/commit/8198db36574a1eb122c2d2303058b9424006f62e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16827 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is ...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16827#discussion_r99743099 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -779,6 +781,30 @@ private[spark] object SparkConf extends Logging { } /** + * check if the given config key-value is valid. + */ + def checkValidSetting(key: String, value: String, conf: SparkConf): Unit = { +if (key.equals("spark.master")) { + // First, there is no need to set 'spark.master' multi-times with different values. + // Second, It is possible for users to set the different 'spark.master' in code with + // `spark-submit` command, and will confuse users. + // So, we should do once check if the 'spark.master' already exists in settings and if + // the previous value is the same with current value. Throw a IllegalArgumentException when + // previous value is different with current value. + val previousOne = try { +Some(conf.get(key)) + } catch { +case e: NoSuchElementException => + None + } + if (previousOne.isDefined && !previousOne.get.equals(value)) { +throw new IllegalArgumentException(s"'spark.master' should not be set with different " + + s"value, previous value is ${previousOne.get} and current value is $value") + } --- End diff -- here, we choose to throw IllegalArgumentException to fail job. Besides, we can also choose to log a warning in a gentle way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16737: [SPARK-19397] [SQL] Make option names of LIBSVM a...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16737#discussion_r99742193 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMOptions.scala --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.libsvm + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +/** + * Options for the LibSVM data source. + */ +private[libsvm] class LibSVMOptions(@transient private val parameters: CaseInsensitiveMap) + extends Serializable { + + import LibSVMOptions._ + + def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters)) + + /** + * Number of features. If unspecified or nonpositive, the number of features will be determined + * automatically at the cost of one additional pass. + */ + val numFeatures = parameters.get(NUM_FEATURES).map(_.toInt) --- End diff -- sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16737: [SPARK-19397] [SQL] Make option names of LIBSVM a...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16737#discussion_r99742158 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -125,6 +124,25 @@ class TextSuite extends QueryTest with SharedSQLContext { } } + test("case insensitive option") { +val extraOptions = Map[String, String]( + "mApReDuCe.output.fileoutputformat.compress" -> "true", --- End diff -- The case is becomes lower cases when we build the DataSource. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16795: [SPARK-19409][BUILD][test-maven] Fix ParquetAvroCompatib...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16795 @srowen and @liancheng Could you review this PR again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16795: [SPARK-19409][BUILD][test-maven] Fix ParquetAvroC...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/16795#discussion_r99740767 --- Diff: resource-managers/mesos/pom.xml --- @@ -49,6 +49,13 @@ + org.apache.spark + spark-tags_${scala.binary.version} + test-jar + test + --- End diff -- This is needed to fix the following error due to `org.apache.spark.tags.ExtendedYarnTest`. ``` [info] Running Spark tests using Maven with these arguments: -Phadoop-2.3 -Phive -Pyarn -Pmesos -Phive-thriftserver -Pkinesis-asl -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test --fail-at-end ... [INFO] Spark Project Mesos FAILURE [ 10.687 s] ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project spark-mesos_2.11: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: There was an error in the forked process [ERROR] java.lang.RuntimeException: Unable to load category: org.apache.spark.tags.ExtendedYarnTest [ERROR] at org.apache.maven.surefire.group.match.SingleGroupMatcher.loadGroupClasses(SingleGroupMatcher.java:139) [ERROR] at ... [ERROR] Caused by: java.lang.ClassNotFoundException: org.apache.spark.tags.ExtendedYarnTest ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16815: [SPARK-19407][SS] defaultFS is used FileSystem.get inste...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16815 yea! - I found this earlier but forgot to track it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16827 **[Test build #72485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72485/testReport)** for PR 16827 at commit [`b50a0bd`](https://github.com/apache/spark/commit/b50a0bd4b6df3aae7a8226e216ab46c1ea393c99). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16795: [SPARK-19409][BUILD][test-maven] Fix ParquetAvroCompatib...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16795 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16795: [SPARK-19409][BUILD][test-maven] Fix ParquetAvroCompatib...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16795 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72479/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is ...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16827#discussion_r99739390 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -779,6 +781,31 @@ private[spark] object SparkConf extends Logging { } /** + * check if the given config key-value is valid. + */ + private def checkValidSetting(key: String, value: String, conf: SparkConf): Unit = { +key match { + case "spark.master" => +// First, there is no need to set 'spark.master' multi-times with different values. +// Second, It is possible for users to set the different 'spark.master' in code with +// `spark-submit` command, and will confuse users. +// So, we should do once check if the 'spark.master' already exists in settings and if +// the previous value is the same with current value. Throw a IllegalArgumentException when +// previous value is different with current value. +val previousOne = try { + Some(conf.get(key)) +} catch { + case e: NoSuchElementException => +None +} +if (previousOne.isDefined && !previousOne.get.equals(value)) { + throw new IllegalArgumentException(s"'spark.master' should not be set with different " + +s"value, previous value is ${previousOne.get} and current value is $value") +} --- End diff -- here, we choose to throw IllegalArgumentException to fail job. Besides, we can also choose to log a warning in a gentle way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16795: [SPARK-19409][BUILD][test-maven] Fix ParquetAvroCompatib...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16795 **[Test build #72479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72479/testReport)** for PR 16795 at commit [`941bf00`](https://github.com/apache/spark/commit/941bf00e7a41c7e64c99c9499acfd5deafc04f39). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16376: [SPARK-18967][SCHEDULER] compute locality levels even if...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16376 **[Test build #72484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72484/testReport)** for PR 16376 at commit [`cf8271e`](https://github.com/apache/spark/commit/cf8271e4c69efa573aabf891d5465736c7c79184). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is set wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16827 **[Test build #72483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72483/testReport)** for PR 16827 at commit [`0cc5997`](https://github.com/apache/spark/commit/0cc599708d213c6aeaf8ad1a748323980444eb15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16376: [SPARK-18967][SCHEDULER] compute locality levels even if...
Github user squito commented on the issue: https://github.com/apache/spark/pull/16376 yes, I think this is ready (I just noticed a couple of minor nits with a fresh read but no real changes) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16656: [SPARK-18116][DStream] Report stream input information a...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16656 cc @tdas also. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is ...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/16827#discussion_r99738940 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -779,6 +781,30 @@ private[spark] object SparkConf extends Logging { } /** + * check if the given config key-value is valid. + */ + private def checkValidSetting(key: String, value: String, conf: SparkConf): Unit = { +key match { + case "spark.master" => +// First, there is no need to set 'spark.master' multi-times with different values. +// Second, It is possible for users to set the different 'spark.master' in code with +// `spark-submit` command, and will confuse users. +// So, we should do once check if the 'spark.master' already exists in settings and if +// the previous value is the same with current value. Throw a IllegalArgumentException when +// previous value is different with current value. +val previousOne = try { + Some(conf.get(key)) +} catch { + case e: NoSuchElementException => +None +} +if (previousOne.isDefined && !previousOne.get.equals(value)) { + throw new IllegalArgumentException("'spark.master' should not be ") +} --- End diff -- here, we choose to throw IllegalArgumentException to fail job. Besides, we can also choose to log a warning in a gentle way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16827: [SPARK-19482][CORE] Fail it if 'spark.master' is ...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/16827 [SPARK-19482][CORE] Fail it if 'spark.master' is set with different value ## What changes were proposed in this pull request? First, there is no need to set 'spark.master' multi-times with different values. Second, It is possible for users to set the different 'spark.master' in code with `spark-submit` command, and will confuse users. So, we should do once check if the 'spark.master' already exists in settings and if the previous value is the same with current value. Throw a IllegalArgumentException when previous value is different with current value. ## How was this patch tested? add new unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-19482 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16827.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16827 commit 0cc599708d213c6aeaf8ad1a748323980444eb15 Author: uncleGenDate: 2017-02-07T03:26:56Z Fail it if 'spark.master' is set with different value --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16797 If the use case where we want to infer the schema but not attempt to write it back as a property as suggested by @budde, is making sense, then the new SQL command approach might not work for it. But actually in this case, it is needed to ask someone who is permitted to migrate and update the table property. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16747: SPARK-16636 Add CalendarIntervalType to documentation
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16747 Actually `CalendarInterval` is already exposed to users, e.g. we can call `collect` on a DataFrame with `CalendarIntervalType` field, and get rows containing `CalendarInterval`. We don't support reading/writing `CalendarIntervalType` though. So I'm ok to add documents for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16171: [SPARK-18739][ML][PYSPARK] Classification and regression...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16171 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72482/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16171: [SPARK-18739][ML][PYSPARK] Classification and regression...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16171 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16171: [SPARK-18739][ML][PYSPARK] Classification and regression...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16171 **[Test build #72482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72482/testReport)** for PR 16171 at commit [`bc85645`](https://github.com/apache/spark/commit/bc856454c7826cec4e3761352cd68e058e6a90ec). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16762: [SPARK-19419] [SPARK-19420] Fix the cross join detection
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16762 is CROSS JOIN a logical or physical concept? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16795: [SPARK-19409][BUILD][test-maven] Fix ParquetAvroCompatib...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16795 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72473/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org