[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586574117 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880577 ## File path: R/pkg/tests/fulltests/test_mllib_classification.R ## @@ -488,4 +488,36 @@ test_that("spark.naiveBayes", { expect_equal(class(collect(predictions)$clicked[1]), "character") }) +test_that("spark.fmClassifier", { + df <- withColumn( +suppressWarnings(createDataFrame(iris)), +"Species", otherwise(when(column("Species") == "Setosa", "Setosa"), "Not-Setosa") + ) + + model1 <- spark.fmClassifier( +df, Species ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE, factorSize = 3 + ) + + prediction1 <- predict(model1, df) + expect_is(prediction1, "SparkDataFrame") + expect_equal(summary(model1)$factorSize, 3) + + # Test model save/load + if (windows_with_hadoop()) { +modelPath <- tempfile(pattern = "spark-fmclassifier", fileext = ".tmp") +write.ml(model1, modelPath) +model2 <- read.ml(modelPath) + +expect_is(model2, "FMClassificationModel") + +prediction2 <- predict(model2, df) +expect_equal( + collect(drop(prediction1, c("rawPrediction", "probability"))), + collect(drop(prediction2, c("rawPrediction", "probability"))) +) + } +}) + + Review comment: nit: delete extra line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880985 ## File path: R/pkg/R/mllib_classification.R ## @@ -649,3 +655,155 @@ setMethod("write.ml", signature(object = "NaiveBayesModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Classification Model +#' +#' \code{spark.fmClassifier} fits a factorization classification model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param thresholds in binary classification, in range [0, 1]. If the estimated probability of +#' class label 1 is > threshold, then predict 1, else 0. A high threshold +#' encourages the model to predict 0 more often; a low threshold encourages the +#' model to predict 1 more often. Note: Setting this with threshold p is +#' equivalent to setting thresholds c(1-p, p). +#' @param seed seed parameter for weights initialization. +#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and +#' label column of string type. +#' Supported options: "skip" (filter out rows with invalid data), +#' "error" (throw an error), "keep" (put invalid data in +#' a special additional bucket, at index numLabels). Default +#' is "error". +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmClassifier} returns a fitted Factorization Machines Classification Model. +#' @rdname spark.fmClassifier +#' @aliases spark.fmClassifier,SparkDataFrame,formula-method +#' @name spark.fmClassifier +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Classification Model +#' model <- spark.fmClassifier( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmClassifier since 3.0.0 +setMethod("spark.fmClassifier", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 0.0, + miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, stepSize=1.0, + tol = 1e-6, solver = c("adamW", "gd"), thresholds = NULL, seed = NULL, + handleInvalid = c("error", "keep", "skip")) { Review comment: any reason why ```fitIntercept``` is not here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379879918 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R Review comment: decisionTree.R -> fmClassifier.R This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880122 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmclasfier-example") + +# $example on:classification$ +# Load training data +df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") +training <- df +test <- df + +# Fit a FM classification model +model <- spark.fmClassifier(df, label ~ features) + Review comment: add ```summary(model)``` as an example too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880075 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmclasfier-example") + +# $example on:classification$ +# Load training data +df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") +training <- df +test <- df + +# Fit a FM classification model +model <- spark.fmClassifier(df, label ~ features) + +# Prediction +predictions <- predict(model, test) Review comment: add ```head(predictions)```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677693 Since this fix is trivial, it looks fine to me if the tests passed. cc: @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880740 ## File path: R/pkg/R/mllib_classification.R ## @@ -649,3 +655,155 @@ setMethod("write.ml", signature(object = "NaiveBayesModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Classification Model +#' +#' \code{spark.fmClassifier} fits a factorization classification model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param thresholds in binary classification, in range [0, 1]. If the estimated probability of +#' class label 1 is > threshold, then predict 1, else 0. A high threshold +#' encourages the model to predict 0 more often; a low threshold encourages the +#' model to predict 1 more often. Note: Setting this with threshold p is +#' equivalent to setting thresholds c(1-p, p). +#' @param seed seed parameter for weights initialization. +#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and +#' label column of string type. +#' Supported options: "skip" (filter out rows with invalid data), +#' "error" (throw an error), "keep" (put invalid data in +#' a special additional bucket, at index numLabels). Default +#' is "error". +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmClassifier} returns a fitted Factorization Machines Classification Model. +#' @rdname spark.fmClassifier +#' @aliases spark.fmClassifier,SparkDataFrame,formula-method +#' @name spark.fmClassifier +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Classification Model +#' model <- spark.fmClassifier( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmClassifier since 3.0.0 Review comment: 3.1.0? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677656 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880095 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmclasfier-example") + +# $example on:classification$ +# Load training data +df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") +training <- df +test <- df + +# Fit a FM classification model +model <- spark.fmClassifier(df, label ~ features) + +# Prediction +predictions <- predict(model, test) +# $example off:classification$ Review comment: add ```sparkR.session.stop()```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments
maropu commented on issue #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments URL: https://github.com/apache/spark/pull/27495#issuecomment-586678147 Looks fine now except for one comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#discussion_r379881028 ## File path: python/pyspark/sql/dataframe.py ## @@ -2153,6 +2153,59 @@ def transform(self, func): "should have been DataFrame." % type(result) return result +@since(3.1) +def sameSemantics(self, other): +""" +Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and +therefore return same results. + +.. note:: The equality comparison here is simplified by tolerating the cosmetic differences +such as attribute names. + +.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return +`False` on the :class:`DataFrame` that return the same results, for instance, from +different plans. Such false negative semantic can be useful when caching as an example. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2)) +True +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2)) +False +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2)) +True +""" +if not isinstance(other, DataFrame): +raise ValueError("other parameter should be of DataFrame; however, got %s" + % type(other)) +return self._jdf.sameSemantics(other._jdf) + +@since(3.1) +def semanticHash(self): +""" +Returns a hash code of the logical query plan against this :class:`DataFrame`. + +.. note:: Unlike the standard hash code, the hash is calculated against the query plan +simplified by tolerating the cosmetic differences such as attribute names. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df2.withColumn("col1", df2.id * 2).semanticHash() +True +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df3.withColumn("col1", df3.id + 2).semanticHash() +False Review comment: ``` Failed example: df1.withColumn("col1", df1.id * 2).semanticHash() == df3.withColumn("col1", df3.id + 2).semanticHash() Differences (ndiff with -expected +actual): - False + True ``` Now we have another unexpected result. (Note L2176 passed, which is expected.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677912 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677913 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23250/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677912 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments
maropu commented on a change in pull request #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments URL: https://github.com/apache/spark/pull/27495#discussion_r379882153 ## File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/comments.sql ## @@ -47,4 +45,5 @@ Now just one deep... */ 'deeply nested example' AS sixth; --QUERY-DELIMITER-END -/* and this is the end of the file */ +-- [SPARK-30824] Support submit sql content only contains comments. Review comment: Is this an ANSI-related issue? https://issues.apache.org/jira/browse/SPARK-30824 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678768 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118494/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678766 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880677 ## File path: R/pkg/tests/fulltests/test_mllib_classification.R ## @@ -488,4 +488,36 @@ test_that("spark.naiveBayes", { expect_equal(class(collect(predictions)$clicked[1]), "character") }) +test_that("spark.fmClassifier", { + df <- withColumn( +suppressWarnings(createDataFrame(iris)), +"Species", otherwise(when(column("Species") == "Setosa", "Setosa"), "Not-Setosa") + ) + + model1 <- spark.fmClassifier( +df, Species ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE, factorSize = 3 + ) + + prediction1 <- predict(model1, df) + expect_is(prediction1, "SparkDataFrame") Review comment: Can we also check the predict result here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#discussion_r379881773 ## File path: python/pyspark/sql/dataframe.py ## @@ -2153,6 +2153,59 @@ def transform(self, func): "should have been DataFrame." % type(result) return result +@since(3.1) +def sameSemantics(self, other): +""" +Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and +therefore return same results. + +.. note:: The equality comparison here is simplified by tolerating the cosmetic differences +such as attribute names. + +.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return +`False` on the :class:`DataFrame` that return the same results, for instance, from +different plans. Such false negative semantic can be useful when caching as an example. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2)) +True +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2)) +False +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2)) +True +""" +if not isinstance(other, DataFrame): +raise ValueError("other parameter should be of DataFrame; however, got %s" + % type(other)) +return self._jdf.sameSemantics(other._jdf) + +@since(3.1) +def semanticHash(self): +""" +Returns a hash code of the logical query plan against this :class:`DataFrame`. + +.. note:: Unlike the standard hash code, the hash is calculated against the query plan +simplified by tolerating the cosmetic differences such as attribute names. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df2.withColumn("col1", df2.id * 2).semanticHash() +True +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df3.withColumn("col1", df3.id + 2).semanticHash() +False Review comment: More tests: ``` >>> df1=spark.range(100) >>> df2=spark.range(100) >>> df3=spark.range(100) >>> df11=df1.withColumn("col1", df1.id +1) >>> df21=df2.withColumn("col1", df2.id -1) >>> df31=df3.withColumn("col1", df3.id *2) >>> df32=df3.withColumn("col1", df3.id +2) >>> df33=df3.withColumn("col1", df3.id /2) >>> df34=df3.withColumn("col1", df3.id -2) >>> df11.semanticHash() 1855039936 >>> df21.semanticHash() 1855039936 >>> df31.semanticHash() -1719131362 >>> df32.semanticHash() -1719131362 >>> df32.semanticHash() -1719131362 >>> df34.semanticHash() -706037631 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379881557 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.classification.{FMClassificationModel, FMClassifier} +import org.apache.spark.ml.feature.{IndexToString, RFormula} +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMClassifierWrapper private ( +val pipeline: PipelineModel, +val features: Array[String], +val labels: Array[String]) extends MLWritable { + import FMClassifierWrapper._ + + private val fmClassificationModel: FMClassificationModel = +pipeline.stages(1).asInstanceOf[FMClassificationModel] + + lazy val rFeatures: Array[String] = if (fmClassificationModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmClassificationModel.getFitIntercept) { +Array(fmClassificationModel.intercept) ++ fmClassificationModel.linear.toArray + } else { +fmClassificationModel.linear.toArray + } + + lazy val rFactors = fmClassificationModel.factors.toArray + + lazy val numClasses: Int = fmClassificationModel.numClasses + + lazy val numFeatures: Int = fmClassificationModel.numFeatures + + lazy val factorSize: Int = fmClassificationModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(PREDICTED_LABEL_INDEX_COL) + .drop(fmClassificationModel.getFeaturesCol) + .drop(fmClassificationModel.getLabelCol) + } + + override def write: MLWriter = new FMClassifierWrapper.FMClassifierWrapperWriter(this) +} + +private[r] object FMClassifierWrapper + extends MLReadable[FMClassifierWrapper] { + + val PREDICTED_LABEL_INDEX_COL = "pred_label_idx" + val PREDICTED_LABEL_COL = "prediction" + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + thresholds: Array[Double], + handleInvalid: String): FMClassifierWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setForceIndexLabel(true) + .setHandleInvalid(handleInvalid) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get labels and feature names from output schema +val (features, labels) = getFeaturesAndLabels(rFormulaModel, data) + +// assemble and fit the pipeline +val fmc = new FMClassifier() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) + .setLabelCol(rFormula.getLabelCol) + .setPredictionCol(PREDICTED_LABEL_INDEX_COL) Review comment: add ```setStepSize```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year
MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#discussion_r379881626 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala ## @@ -290,32 +293,38 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper { } test("hours") { -var input = date(2015, 3, 18, 13, 2, 11, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 13) -assert(getHours(input, TimeZoneGMT) === 20) -input = date(2015, 12, 8, 2, 7, 9, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 2) -assert(getHours(input, TimeZoneGMT) === 10) +var input = date(2015, 3, 18, 13, 2, 11, 0, zonePST) +assert(getHours(input, zonePST) === 13) +assert(getHours(input, zoneGMT) === 20) +input = date(2015, 12, 8, 2, 7, 9, 0, zonePST) +assert(getHours(input, zonePST) === 2) +assert(getHours(input, zoneGMT) === 10) +input = date(10, 1, 1, 0, 0, 0, 0, zonePST) +assert(getHours(input, zonePST) === 0) Review comment: Before the changes: ```sql spark-sql> select hour(timestamp '0010-01-01 00:00:00'); 23 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
beliefer commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder URL: https://github.com/apache/spark/pull/27592#discussion_r379881659 ## File path: core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala ## @@ -74,7 +76,8 @@ private[spark] abstract class ConfigEntry[T] ( def defaultValue: Option[T] = None override def toString: String = { -s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, public=$isPublic)" +s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, " + + s"public=$isPublic, version = $version)" Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880497 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.classification.{FMClassificationModel, FMClassifier} +import org.apache.spark.ml.feature.{IndexToString, RFormula} +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMClassifierWrapper private ( +val pipeline: PipelineModel, +val features: Array[String], +val labels: Array[String]) extends MLWritable { + import FMClassifierWrapper._ + + private val fmClassificationModel: FMClassificationModel = +pipeline.stages(1).asInstanceOf[FMClassificationModel] + + lazy val rFeatures: Array[String] = if (fmClassificationModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmClassificationModel.getFitIntercept) { +Array(fmClassificationModel.intercept) ++ fmClassificationModel.linear.toArray + } else { +fmClassificationModel.linear.toArray + } + + lazy val rFactors = fmClassificationModel.factors.toArray + + lazy val numClasses: Int = fmClassificationModel.numClasses + + lazy val numFeatures: Int = fmClassificationModel.numFeatures + + lazy val factorSize: Int = fmClassificationModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(PREDICTED_LABEL_INDEX_COL) + .drop(fmClassificationModel.getFeaturesCol) + .drop(fmClassificationModel.getLabelCol) + } + + override def write: MLWriter = new FMClassifierWrapper.FMClassifierWrapperWriter(this) +} + +private[r] object FMClassifierWrapper + extends MLReadable[FMClassifierWrapper] { + + val PREDICTED_LABEL_INDEX_COL = "pred_label_idx" + val PREDICTED_LABEL_COL = "prediction" + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + thresholds: Array[Double], + handleInvalid: String): FMClassifierWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setForceIndexLabel(true) + .setHandleInvalid(handleInvalid) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get labels and feature names from output schema +val (features, labels) = getFeaturesAndLabels(rFormulaModel, data) + +// assemble and fit the pipeline +val fmc = new FMClassifier() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) + .setLabelCol(rFormula.getLabelCol) + .setPredictionCol(PREDICTED_LABEL_INDEX_COL) + +if (seed != null) { Review comment: ```if (seed != null && seed.length > 0)```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail:
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677822 **[Test build #118494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)** for PR 27594 at commit [`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882027 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model Review comment: nit: ```Model Model``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882145 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model +#' +#' \code{spark.fmRegressor} fits a factorization regression model against a SparkDataFrame. +#' Users can call \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. Review comment: I guess also mention ```summary``` here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882502 ## File path: R/pkg/tests/fulltests/test_mllib_regression.R ## @@ -551,4 +551,33 @@ test_that("spark.survreg", { } }) + +test_that("spark.fmRegressor", { + df <- suppressWarnings(createDataFrame(iris)) + + model <- spark.fmRegressor( +df, Sepal_Width ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE + ) + + prediction1 <- predict(model, df) + expect_is(prediction1, "SparkDataFrame") + + # Test model save/load + if (windows_with_hadoop()) { +modelPath <- tempfile(pattern = "spark-fmregressor", fileext = ".tmp") +write.ml(model, modelPath) +model2 <- read.ml(modelPath) + +expect_is(model2, "FMRegressionModel") + +prediction2 <- predict(model2, df) +expect_equal( + collect(prediction1), + collect(prediction2) +) + } +}) + + Review comment: nit: delete extra line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379881943 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + Review comment: nit: delete extra line? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882615 ## File path: examples/src/main/r/ml/fmRegressor.R ## @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R Review comment: change this ```decisionTree.R```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882600 ## File path: examples/src/main/r/ml/fmRegressor.R ## @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmRegressor-example") + +# $example on +# Load training data +df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +training_test <- randomSplit(df, c(0.7, 0.3)) +training <- training_test[[1]] +test <- training_test[[2]] + + Review comment: nit: delete extra line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379883014 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMRegressorWrapper.scala ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.attribute.AttributeGroup +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMRegressorWrapper private ( +val pipeline: PipelineModel, +val features: Array[String]) extends MLWritable { + import FMRegressorWrapper._ + + private val fmRegressionModel: FMRegressionModel = +pipeline.stages(1).asInstanceOf[FMRegressionModel] + + lazy val rFeatures: Array[String] = if (fmRegressionModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmRegressionModel.getFitIntercept) { +Array(fmRegressionModel.intercept) ++ fmRegressionModel.linear.toArray + } else { +fmRegressionModel.linear.toArray + } + + lazy val rFactors = fmRegressionModel.factors.toArray + + lazy val numFeatures: Int = fmRegressionModel.numFeatures + + lazy val factorSize: Int = fmRegressionModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(fmRegressionModel.getFeaturesCol) + } + + override def write: MLWriter = new FMRegressorWrapper.FMRegressorWrapperWriter(this) +} + +private[r] object FMRegressorWrapper + extends MLReadable[FMRegressorWrapper] { + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + stringIndexerOrderType: String): FMRegressorWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setStringIndexerOrderType(stringIndexerOrderType) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get feature names from output schema +val schema = rFormulaModel.transform(data).schema +val featureAttrs = AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol)) + .attributes.get +val features = featureAttrs.map(_.name.get) + +// assemble and fit the pipeline +val fmr = new FMRegressor() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) + +if (seed != null) { Review comment: also check ```seed.length > 0```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882860 ## File path: examples/src/main/r/ml/fmRegressor.R ## @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmRegressor-example") + +# $example on +# Load training data +df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +training_test <- randomSplit(df, c(0.7, 0.3)) +training <- training_test[[1]] +test <- training_test[[2]] + + +# Fit a FM regression model +model <- spark.fmRegressor(training, label ~ features) + +# Prediction +predictions <- predict(model, test) Review comment: same as the classifier example, I guess add ```summary(model)```, ```head(predictions)``` and also add ```sparkR.session.stop()``` in the end? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882920 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMRegressorWrapper.scala ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.attribute.AttributeGroup +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMRegressorWrapper private ( +val pipeline: PipelineModel, +val features: Array[String]) extends MLWritable { + import FMRegressorWrapper._ + + private val fmRegressionModel: FMRegressionModel = +pipeline.stages(1).asInstanceOf[FMRegressionModel] + + lazy val rFeatures: Array[String] = if (fmRegressionModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmRegressionModel.getFitIntercept) { +Array(fmRegressionModel.intercept) ++ fmRegressionModel.linear.toArray + } else { +fmRegressionModel.linear.toArray + } + + lazy val rFactors = fmRegressionModel.factors.toArray + + lazy val numFeatures: Int = fmRegressionModel.numFeatures + + lazy val factorSize: Int = fmRegressionModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(fmRegressionModel.getFeaturesCol) + } + + override def write: MLWriter = new FMRegressorWrapper.FMRegressorWrapperWriter(this) +} + +private[r] object FMRegressorWrapper + extends MLReadable[FMRegressorWrapper] { + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + stringIndexerOrderType: String): FMRegressorWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setStringIndexerOrderType(stringIndexerOrderType) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get feature names from output schema +val schema = rFormulaModel.transform(data).schema +val featureAttrs = AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol)) + .attributes.get +val features = featureAttrs.map(_.name.get) + +// assemble and fit the pipeline +val fmr = new FMRegressor() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) Review comment: add ```setStepSize```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882476 ## File path: R/pkg/tests/fulltests/test_mllib_regression.R ## @@ -551,4 +551,33 @@ test_that("spark.survreg", { } }) + +test_that("spark.fmRegressor", { + df <- suppressWarnings(createDataFrame(iris)) + + model <- spark.fmRegressor( +df, Sepal_Width ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE + ) + + prediction1 <- predict(model, df) + expect_is(prediction1, "SparkDataFrame") Review comment: I guess we may want to check the predict result too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882343 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model +#' +#' \code{spark.fmRegressor} fits a factorization regression model against a SparkDataFrame. +#' Users can call \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param seed seed parameter for weights initialization. +#' @param stringIndexerOrderType how to order categories of a string feature column. This is used to +#' decide the base level of a string feature as the last category +#' after ordering is dropped when encoding strings. Supported options +#' are "frequencyDesc", "frequencyAsc", "alphabetDesc", and +#' "alphabetAsc". The default value is "frequencyDesc". When the +#' ordering is set to "alphabetDesc", this drops the same category +#' as R when encoding strings. +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmRegressor} returns a fitted Factorization Machines Regression Model. +#' +#' @rdname spark.fmRegressor +#' @aliases spark.fmRegressor,SparkDataFrame,formula-method +#' @name spark.fmRegressor +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Regression Model +#' model <- spark.fmRegressor( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmRegressor since 3.1.0 +setMethod("spark.fmRegressor", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 0.0, + miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, stepSize=1.0, + tol = 1e-6, solver = c("adamW", "gd"), seed = NULL, + stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", + "alphabetDesc", "alphabetAsc")) { + +formula <- paste(deparse(formula), collapse = "") + +if (!is.null(seed)) { + seed <- as.character(as.integer(seed)) +} + +solver <- match.arg(solver) +stringIndexerOrderType <- match.arg(stringIndexerOrderType) + +jobj <- callJStatic("org.apache.spark.ml.r.FMRegressorWrapper", +"fit", +data@sdf, +formula, +as.integer(factorSize), +as.logical(fitLinear), +as.numeric(regParam), +as.numeric(miniBatchFraction), +as.numeric(initStd), +as.integer(maxIter), +as.numeric(stepSize), +as.numeric(tol), +solver, +seed, +stringIndexerOrderType) +new("FMRegressionModel", jobj = jobj) + }) + + Review comment: nit: delete extra line?
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882358 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model +#' +#' \code{spark.fmRegressor} fits a factorization regression model against a SparkDataFrame. +#' Users can call \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param seed seed parameter for weights initialization. +#' @param stringIndexerOrderType how to order categories of a string feature column. This is used to +#' decide the base level of a string feature as the last category +#' after ordering is dropped when encoding strings. Supported options +#' are "frequencyDesc", "frequencyAsc", "alphabetDesc", and +#' "alphabetAsc". The default value is "frequencyDesc". When the +#' ordering is set to "alphabetDesc", this drops the same category +#' as R when encoding strings. +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmRegressor} returns a fitted Factorization Machines Regression Model. +#' +#' @rdname spark.fmRegressor +#' @aliases spark.fmRegressor,SparkDataFrame,formula-method +#' @name spark.fmRegressor +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Regression Model +#' model <- spark.fmRegressor( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmRegressor since 3.1.0 +setMethod("spark.fmRegressor", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 0.0, + miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, stepSize=1.0, + tol = 1e-6, solver = c("adamW", "gd"), seed = NULL, + stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", + "alphabetDesc", "alphabetAsc")) { + +formula <- paste(deparse(formula), collapse = "") + +if (!is.null(seed)) { + seed <- as.character(as.integer(seed)) +} + +solver <- match.arg(solver) +stringIndexerOrderType <- match.arg(stringIndexerOrderType) + +jobj <- callJStatic("org.apache.spark.ml.r.FMRegressorWrapper", +"fit", +data@sdf, +formula, +as.integer(factorSize), +as.logical(fitLinear), +as.numeric(regParam), +as.numeric(miniBatchFraction), +as.numeric(initStd), +as.integer(maxIter), +as.numeric(stepSize), +as.numeric(tol), +solver, +seed, +stringIndexerOrderType) +new("FMRegressionModel", jobj = jobj) + }) + + +# Returns the summary of a FM Regression model produced by \code{spark.fmRegressor} + +#' @param
[GitHub] [spark] maropu commented on issue #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
maropu commented on issue #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder URL: https://github.com/apache/spark/pull/27592#issuecomment-586677152 For reviewers, can you add a screenshot of a html document generated by `gen-sql-config-docs.py` in the PR description? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866 **[Test build #118493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year
MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#discussion_r379881509 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala ## @@ -290,32 +293,38 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper { } test("hours") { -var input = date(2015, 3, 18, 13, 2, 11, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 13) -assert(getHours(input, TimeZoneGMT) === 20) -input = date(2015, 12, 8, 2, 7, 9, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 2) -assert(getHours(input, TimeZoneGMT) === 10) +var input = date(2015, 3, 18, 13, 2, 11, 0, zonePST) +assert(getHours(input, zonePST) === 13) +assert(getHours(input, zoneGMT) === 20) +input = date(2015, 12, 8, 2, 7, 9, 0, zonePST) +assert(getHours(input, zonePST) === 2) +assert(getHours(input, zoneGMT) === 10) +input = date(10, 1, 1, 0, 0, 0, 0, zonePST) +assert(getHours(input, zonePST) === 0) Review comment: This is new test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
maropu commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder URL: https://github.com/apache/spark/pull/27592#discussion_r379881204 ## File path: core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala ## @@ -74,7 +76,8 @@ private[spark] abstract class ConfigEntry[T] ( def defaultValue: Option[T] = None override def toString: String = { -s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, public=$isPublic)" +s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, " + + s"public=$isPublic, version = $version)" Review comment: nit: `version=$version` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#discussion_r379882384 ## File path: python/pyspark/sql/dataframe.py ## @@ -2153,6 +2153,59 @@ def transform(self, func): "should have been DataFrame." % type(result) return result +@since(3.1) +def sameSemantics(self, other): +""" +Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and +therefore return same results. + +.. note:: The equality comparison here is simplified by tolerating the cosmetic differences +such as attribute names. + +.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return +`False` on the :class:`DataFrame` that return the same results, for instance, from +different plans. Such false negative semantic can be useful when caching as an example. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2)) +True +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2)) +False +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2)) +True +""" +if not isinstance(other, DataFrame): +raise ValueError("other parameter should be of DataFrame; however, got %s" + % type(other)) +return self._jdf.sameSemantics(other._jdf) + +@since(3.1) +def semanticHash(self): +""" +Returns a hash code of the logical query plan against this :class:`DataFrame`. + +.. note:: Unlike the standard hash code, the hash is calculated against the query plan +simplified by tolerating the cosmetic differences such as attribute names. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df2.withColumn("col1", df2.id * 2).semanticHash() +True +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df3.withColumn("col1", df3.id + 2).semanticHash() +False Review comment: Same behavior for dataframe from `spark.read.load()` ``` >>> df4=spark.read.load(csv_file_path, format="csv", inferSchema="true", header="true") >>> df4.schema StructType(List(StructField(bool_col,BooleanType,true),StructField(float_col,DoubleType,true),StructField(double_col,DoubleType,true),StructField(int_col,IntegerType,true),StructField(long_col,IntegerType,true))) >>> df4.withColumn("col1", df4.int_col *2).semanticHash() -1746346451 >>> df4.withColumn("col1", df4.int_col +2).semanticHash() -1746346451 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874303 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files Review comment: I'll try to rephrase - maybe we can refer as "target" or "candidates" instead of "files". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874393 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files +into one compact file with discarding some events. Once rewriting is done, original log files will be deleted. Review comment: Yeah it wouldn't matter for the logic as listing event log files would take the "last" compact file, and the right side of event log files. And I'd agree to worth to mention the deletion is best effort. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874393 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files +into one compact file with discarding some events. Once rewriting is done, original log files will be deleted. Review comment: Yeah it wouldn't matter for the entire logic as listing event log files would take the "last" compact file, and the right side of event log files. But we don't retry deleting them. I'd agree to worth to mention the deletion is best effort. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874443 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files +into one compact file with discarding some events. Once rewriting is done, original log files will be deleted. + +The compaction tries to exclude the events which point to the outdated things like jobs, and so on. As of now, below describes +the candidates of events to be excluded: + +* Events for the job which is finished, and related stage/tasks events +* Events for the executor which is terminated +* Events for the SQL execution which is finished, and related job/stage/tasks events + +but the details can be changed afterwards. Review comment: OK agreed. Once we change the logic we may just need to change here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874489 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files +into one compact file with discarding some events. Once rewriting is done, original log files will be deleted. + +The compaction tries to exclude the events which point to the outdated things like jobs, and so on. As of now, below describes +the candidates of events to be excluded: + +* Events for the job which is finished, and related stage/tasks events +* Events for the executor which is terminated +* Events for the SQL execution which is finished, and related job/stage/tasks events + +but the details can be changed afterwards. Review comment: I thought the effect is intuitive as we "exclude" events during rewriting, but if explicitly mentioning would make it clearer, let's do it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874721 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files +into one compact file with discarding some events. Once rewriting is done, original log files will be deleted. + +The compaction tries to exclude the events which point to the outdated things like jobs, and so on. As of now, below describes +the candidates of events to be excluded: + +* Events for the job which is finished, and related stage/tasks events +* Events for the executor which is terminated +* Events for the SQL execution which is finished, and related job/stage/tasks events + +but the details can be changed afterwards. + +Please note that Spark History Server may not compact the old event log files if figures out not a lot of space Review comment: We already described which events are the candidates in above, so it's saying "Please note that Spark History Server may not compact the old event log files if figures out not a lot of space would be reduced during compaction because these event log files majorly fill with running jobs or SQL executions." Does it answer your question? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on a change in pull request #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#discussion_r379874734 ## File path: docs/monitoring.md ## @@ -95,6 +95,44 @@ The history server can be configured as follows: +### Applying compaction of old event log files + +A long-running streaming application can bring a huge single event log file which may cost a lot to maintain and +also requires a bunch of resource to replay per each update in Spark History Server. + +Enabling spark.eventLog.rolling.enabled and spark.eventLog.rolling.maxFileSize would +let you have multiple event log files instead of single huge event log file which may help some scenarios on its own, +but it still doesn't help you reducing the overall size of logs. + +Spark History Server can apply 'compaction' on the rolling event log files to reduce the overall size of +logs, via setting the configuration spark.history.fs.eventLog.rolling.maxFilesToRetain on the +Spark History Server. + +When the compaction happens, History Server lists all the available event log files, and considers the event log files older than +retained log files as a target of compaction. For example, if the application A has 5 event log files and +spark.history.fs.eventLog.rolling.maxFilesToRetain is set to 2, first 3 log files will be selected to be compacted. + +Once it selects the files, it analyzes these files to figure out which events can be excluded, and rewrites these files +into one compact file with discarding some events. Once rewriting is done, original log files will be deleted. + +The compaction tries to exclude the events which point to the outdated things like jobs, and so on. As of now, below describes +the candidates of events to be excluded: + +* Events for the job which is finished, and related stage/tasks events +* Events for the executor which is terminated +* Events for the SQL execution which is finished, and related job/stage/tasks events + +but the details can be changed afterwards. + +Please note that Spark History Server may not compact the old event log files if figures out not a lot of space +would be reduced during compaction. For streaming query (including Structured Streaming) we normally expect compaction Review comment: OK looks redundant. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WinkerDu commented on a change in pull request #26971: [SPARK-30320][SQL] Fix insert overwrite to DataSource table with dynamic partition error
WinkerDu commented on a change in pull request #26971: [SPARK-30320][SQL] Fix insert overwrite to DataSource table with dynamic partition error URL: https://github.com/apache/spark/pull/26971#discussion_r379874760 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1521,4 +1521,10 @@ package object config { .bytesConf(ByteUnit.BYTE) .createOptional + private[spark] val MAX_LOCAL_TASK_FAILURES = ConfigBuilder("spark.task.local.maxFailures") +.doc("The max failure times for a task while SparkContext running in Local mode, " + Review comment: In UT class InsertWithMultipleTaskAttemptSuite, I don't expect launching speculative task in local mode. Actually, I make a customized commit protocol named "InsertExceptionCommitProtocol" in InsertWithMultipleTaskAttemptSuite, which overriding commitTask method to fail the first commit task on purpose and restore to normal in subsequent commit tasks. This scene is similar to what happened in speculative tasks with failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27596: [WIP] Fix getting of time components before 1582 year
SparkQA commented on issue #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#issuecomment-586669414 **[Test build #118483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118483/testReport)** for PR 27596 at commit [`4759725`](https://github.com/apache/spark/commit/475972516fc2fe50496d480eab334a29e05c50bd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27596: [WIP] Fix getting of time components before 1582 year
SparkQA removed a comment on issue #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#issuecomment-586657190 **[Test build #118483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118483/testReport)** for PR 27596 at commit [`4759725`](https://github.com/apache/spark/commit/475972516fc2fe50496d480eab334a29e05c50bd). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27596: [WIP] Fix getting of time components before 1582 year
AmplabJenkins removed a comment on issue #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#issuecomment-586669521 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27596: [WIP] Fix getting of time components before 1582 year
AmplabJenkins commented on issue #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#issuecomment-586669521 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27596: [WIP] Fix getting of time components before 1582 year
AmplabJenkins commented on issue #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#issuecomment-586669523 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118483/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27596: [WIP] Fix getting of time components before 1582 year
AmplabJenkins removed a comment on issue #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#issuecomment-586669523 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118483/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586669868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23244/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586669867 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586669868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23244/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586669867 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
SparkQA commented on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/27429#issuecomment-586670026 **[Test build #118486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118486/testReport)** for PR 27429 at commit [`14ca8d3`](https://github.com/apache/spark/commit/14ca8d326fcc5035da5b7a99772aacff8edad163). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class WorkerDecommission(` * ` case class DecommissionExecutor(executorId: String) extends CoarseGrainedClusterMessage` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670141 **[Test build #118487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118487/testReport)** for PR 27398 at commit [`841d4d0`](https://github.com/apache/spark/commit/841d4d0c2de6fae1d8cc9dbaee424079630a0540). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
AmplabJenkins removed a comment on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/27429#issuecomment-586670154 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
AmplabJenkins commented on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/27429#issuecomment-586670157 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118486/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
AmplabJenkins removed a comment on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/27429#issuecomment-586670157 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118486/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
SparkQA removed a comment on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/27429#issuecomment-586657759 **[Test build #118486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118486/testReport)** for PR 27429 at commit [`14ca8d3`](https://github.com/apache/spark/commit/14ca8d326fcc5035da5b7a99772aacff8edad163). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
AmplabJenkins commented on issue #27429: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/27429#issuecomment-586670154 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
HeartSaVioR commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671121 @dongjoon-hyun @tgravescs @gaborgsomogyi I've reflected review comments and updated the PR description. As these comments are not only about typos or syntaxes, I guess my reflections may not be enough. Please take a second look and provide suggestions. Thanks in advance! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670958 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118487/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670956 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670941 **[Test build #118487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118487/testReport)** for PR 27398 at commit [`841d4d0`](https://github.com/apache/spark/commit/841d4d0c2de6fae1d8cc9dbaee424079630a0540). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670958 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118487/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670141 **[Test build #118487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118487/testReport)** for PR 27398 at commit [`841d4d0`](https://github.com/apache/spark/commit/841d4d0c2de6fae1d8cc9dbaee424079630a0540). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586670956 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kiszk commented on issue #27577: [DOC] add config naming guideline
kiszk commented on issue #27577: [DOC] add config naming guideline URL: https://github.com/apache/spark/pull/27577#issuecomment-586671458 Beyond this PR, can we create a sanity checker? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23245/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671228 **[Test build #118488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)** for PR 27398 at commit [`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671325 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23245/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671325 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118488/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671757 **[Test build #118488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)** for PR 27398 at commit [`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118488/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671228 **[Test build #118488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)** for PR 27398 at commit [`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671776 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671776 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793 **[Test build #118489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year
MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#discussion_r379880203 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala ## @@ -51,7 +51,6 @@ trait TimeZoneAwareExpression extends Expression { /** Returns a copy of this expression with the specified timeZoneId. */ def withTimeZone(timeZoneId: String): TimeZoneAwareExpression - @transient lazy val timeZone: TimeZone = DateTimeUtils.getTimeZone(timeZoneId.get) Review comment: Finally, all expressions bound on legacy `TimeZone` have been gone. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
SparkQA commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586675997 **[Test build #118490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118490/testReport)** for PR 24936 at commit [`f99a528`](https://github.com/apache/spark/commit/f99a528fafa9f59a37de8fd7b824c4168d7f4e26). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676073 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23247/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676073 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23247/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676070 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org