[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2020-01-10 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r365503334
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -442,11 +488,17 @@ object RFormulaModel extends MLReadable[RFormulaModel] {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
 
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath).select("label", "terms", 
"hasIntercept").head()
+  val df = sparkSession.read.parquet(dataPath)
+  val columns = Seq("label", "terms", "hasIntercept", "evalExprs")
+  val data = df.select(columns.intersect(df.columns).map(df(_)): _*)
+.head()
   val label = data.getString(0)
   val terms = data.getAs[Seq[Seq[String]]](1)
   val hasIntercept = data.getBoolean(2)
-  val resolvedRFormula = ResolvedRFormula(label, terms, hasIntercept)
+  // Check for old saved models (spark version < 3), they do not have 
evalExprs.
+  val (major, minor) = 
VersionUtils.majorMinorVersion(metadata.sparkVersion)
+  val evalExprs = if (major.toInt < 3) Seq[String]() else 
data.getAs[Seq[String]](3)
+  val resolvedRFormula = ResolvedRFormula(label, terms, hasIntercept, 
evalExprs)
 
 Review comment:
   Thanks, added a testcase for loading a model from 2.4.4.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2020-01-10 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r365503223
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -344,29 +360,28 @@ class RFormulaModel private[feature](
 private[ml] val pipelineModel: PipelineModel)
   extends Model[RFormulaModel] with RFormulaBase with MLWritable {
 
+  private val foldExprs = (df: DataFrame) => 
resolvedFormula.evalExprs.foldLeft(df) _
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 checkCanTransform(dataset.schema)
-transformLabel(pipelineModel.transform(dataset))
+val withExprs = foldExprs(dataset.toDF) {
+  case(df, colname) => df.withColumn(colname, expr(colname))
+}
+val withFeatures = pipelineModel.transform(withExprs)
+val withLabel = transformLabel(withFeatures)
+foldExprs(withLabel) {
+  case(df, colname) => df.drop(col(s"`$colname`"))
 
 Review comment:
   Indeed, foldLeft is unnecessary. Switched to ```drop(colNames: String*)```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2020-01-10 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r365503203
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -344,29 +360,28 @@ class RFormulaModel private[feature](
 private[ml] val pipelineModel: PipelineModel)
   extends Model[RFormulaModel] with RFormulaBase with MLWritable {
 
+  private val foldExprs = (df: DataFrame) => 
resolvedFormula.evalExprs.foldLeft(df) _
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 checkCanTransform(dataset.schema)
-transformLabel(pipelineModel.transform(dataset))
+val withExprs = foldExprs(dataset.toDF) {
+  case(df, colname) => df.withColumn(colname, expr(colname))
+}
 
 Review comment:
   Thanks, I wasn't aware of that function. Switched to withColumns.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-10 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r323005147
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
 ##
 @@ -269,40 +310,78 @@ private[ml] object RFormulaParser extends RegexParsers {
   }
 
   private val intercept: Parser[Term] =
-"([01])".r ^^ { case a => Intercept(a == "1") }
+skipSpace("([01])".r) ^^ { case a => Intercept(a == "1") }
 
   private val columnRef: Parser[ColumnRef] =
-"([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*".r ^^ { case a => ColumnRef(a) }
+skipSpace("([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*".r) ^^ { case a => 
ColumnRef(a) }
 
-  private val empty: Parser[ColumnRef] = "" ^^ { case a => ColumnRef("") }
+  private val empty: Parser[ColumnRef] = skipSpace("".r) ^^ { case a => 
ColumnRef("") }
 
-  private val label: Parser[ColumnRef] = columnRef | empty
+  private val label: Parser[Label] = evalExpr | columnRef | empty
 
-  private val dot: Parser[Term] = "\\.".r ^^ { case _ => Dot }
+  private val dot: Parser[Term] = skipSpace("\\.".r) ^^ { case _ => Dot }
 
-  private val parens: Parser[Term] = "(" ~> expr <~ ")"
+  private val parens: Parser[Term] = skipSpace("\\(".r) ~> expr <~ 
skipSpace("\\)".r)
 
-  private val term: Parser[Term] = parens | intercept | columnRef | dot
+  private val term: Parser[Term] = evalExpr | parens | intercept | columnRef | 
dot
 
-  private val pow: Parser[Term] = term ~ "^" ~ "^[1-9]\\d*".r ^^ {
+  private val pow: Parser[Term] = term ~ "^" ~ skipSpace("^[1-9]\\d*".r) ^^ {
 case base ~ "^" ~ degree => power(base, degree.toInt)
   } | term
 
-  private val interaction: Parser[Term] = pow * (":" ^^^ { interact _ })
+  private val interaction: Parser[Term] = pow * (skipSpace("\\:".r) ^^^ { 
interact _ })
 
-  private val factor = interaction * ("*" ^^^ { cross _ })
+  private val factor = interaction * (skipSpace("\\*".r) ^^^ { cross _ })
 
-  private val sum = factor * ("+" ^^^ { add _ } |
-"-" ^^^ { subtract _ })
+  private val sum = factor * (skipSpace("\\+".r) ^^^ { add _ } |
+skipSpace("\\-".r) ^^^ { subtract _ })
 
   private val expr = (sum | term)
 
-  private val formula: Parser[ParsedRFormula] =
-(label ~ "~" ~ expr) ^^ { case r ~ "~" ~ t => ParsedRFormula(r, 
t.asTerms.terms) }
+  private val formula: Parser[ParsedRFormula] = (label ~ skipSpace("\\~".r) ~ 
expr) ^^ {
+case r ~ "~" ~ t => ParsedRFormula(r, t.asTerms.terms) }
 
   def parse(value: String): ParsedRFormula = parseAll(formula, value) match {
 case Success(result, _) => result
 case failure: NoSuccess => throw new IllegalArgumentException(
   "Could not parse formula: " + value)
   }
 }
+
+/**
+ * Parser for evaluated expressions in a formula. An evaluated expression is
+ * any alphanumeric identifiers followed by (), e.g. `func123(a+b, func())`,
+ * or anything inside `I()`. A valid expression is any string-parentheses 
product,
+ * such that if there are any parentheses ('(' or ')') they're all balanced.
+ */
+private[ml] trait EvalExprParser extends RegexParsers {
 
 Review comment:
   It looks like spark sql functions skip the whitespace between function 
identifier and parenthesis, so I enabled skipping the same whitespace.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-09 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r322199371
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -404,6 +429,30 @@ class RFormulaModel private[feature](
   s"Label column already exists and is not of type 
${NumericType.simpleString}.")
   }
 
+  private def foldExprs(dataframe: DataFrame)(f: (DataFrame, String) => 
DataFrame): DataFrame =
+resolvedFormula.evalExprs.foldLeft(dataframe)(f)
 
 Review comment:
   Got it now, thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320793640
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
 ##
 @@ -269,40 +310,78 @@ private[ml] object RFormulaParser extends RegexParsers {
   }
 
   private val intercept: Parser[Term] =
-"([01])".r ^^ { case a => Intercept(a == "1") }
+skipSpace("([01])".r) ^^ { case a => Intercept(a == "1") }
 
   private val columnRef: Parser[ColumnRef] =
-"([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*".r ^^ { case a => ColumnRef(a) }
+skipSpace("([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*".r) ^^ { case a => 
ColumnRef(a) }
 
-  private val empty: Parser[ColumnRef] = "" ^^ { case a => ColumnRef("") }
+  private val empty: Parser[ColumnRef] = skipSpace("".r) ^^ { case a => 
ColumnRef("") }
 
-  private val label: Parser[ColumnRef] = columnRef | empty
+  private val label: Parser[Label] = evalExpr | columnRef | empty
 
-  private val dot: Parser[Term] = "\\.".r ^^ { case _ => Dot }
+  private val dot: Parser[Term] = skipSpace("\\.".r) ^^ { case _ => Dot }
 
-  private val parens: Parser[Term] = "(" ~> expr <~ ")"
+  private val parens: Parser[Term] = skipSpace("\\(".r) ~> expr <~ 
skipSpace("\\)".r)
 
-  private val term: Parser[Term] = parens | intercept | columnRef | dot
+  private val term: Parser[Term] = evalExpr | parens | intercept | columnRef | 
dot
 
-  private val pow: Parser[Term] = term ~ "^" ~ "^[1-9]\\d*".r ^^ {
+  private val pow: Parser[Term] = term ~ "^" ~ skipSpace("^[1-9]\\d*".r) ^^ {
 case base ~ "^" ~ degree => power(base, degree.toInt)
   } | term
 
-  private val interaction: Parser[Term] = pow * (":" ^^^ { interact _ })
+  private val interaction: Parser[Term] = pow * (skipSpace("\\:".r) ^^^ { 
interact _ })
 
-  private val factor = interaction * ("*" ^^^ { cross _ })
+  private val factor = interaction * (skipSpace("\\*".r) ^^^ { cross _ })
 
-  private val sum = factor * ("+" ^^^ { add _ } |
-"-" ^^^ { subtract _ })
+  private val sum = factor * (skipSpace("\\+".r) ^^^ { add _ } |
+skipSpace("\\-".r) ^^^ { subtract _ })
 
   private val expr = (sum | term)
 
-  private val formula: Parser[ParsedRFormula] =
-(label ~ "~" ~ expr) ^^ { case r ~ "~" ~ t => ParsedRFormula(r, 
t.asTerms.terms) }
+  private val formula: Parser[ParsedRFormula] = (label ~ skipSpace("\\~".r) ~ 
expr) ^^ {
+case r ~ "~" ~ t => ParsedRFormula(r, t.asTerms.terms) }
 
   def parse(value: String): ParsedRFormula = parseAll(formula, value) match {
 case Success(result, _) => result
 case failure: NoSuccess => throw new IllegalArgumentException(
   "Could not parse formula: " + value)
   }
 }
+
+/**
+ * Parser for evaluated expressions in a formula. An evaluated expression is
+ * any alphanumeric identifiers followed by (), e.g. `func123(a+b, func())`,
+ * or anything inside `I()`. A valid expression is any string-parentheses 
product,
+ * such that if there are any parentheses ('(' or ')') they're all balanced.
+ */
+private[ml] trait EvalExprParser extends RegexParsers {
 
 Review comment:
   This one is responsible for parsing the function identifier, parenthesis and 
everything inside the parenthesis. Whatever inside the parenthesis is parsed 
with whitespaces, so no need to skip there. Function identifiers don't have 
whitespaces as well. One thing to additionally skip might be the whitespace 
between the function identifier and the trailing parenthesis, such as;
   
   `y ~ I ( log(b) )`
   or
   `y ~ exp (b)`
   
   Currently, the above formulas would fail to parse due to the space between 
the parenthesis and function identifier, e.g `I (` and `exp (`. Should these be 
allowed? It's not so hard to enable these, but I haven't really given a good 
thought on what might be unwanted side effects.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320783880
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
 ##
 @@ -247,9 +273,24 @@ private[ml] case class Terms(terms: Seq[Term]) extends 
Term {
 
 /**
  * Limited implementation of R formula parsing. Currently supports: '~', '+', 
'-', '.', ':',
- * '*', '^'.
+ * '*', '^', 'I()'.
  */
-private[ml] object RFormulaParser extends RegexParsers {
+private[ml] object RFormulaParser extends RegexParsers with EvalExprParser {
+
+  /**
+   * Whether to skip whitespace in literals and regex is currently only 
achived with
+   * a global switch, and by default it's skipped. We'd like it to be skipped 
for most parsers,
 
 Review comment:
   Oh, when I re-read that comment I realized I didn't make much sense, sorry 
about that! Let me try to rephrase.
   
   From the perspective of RFormulaParser, it shouldn't be configurable. That 
switch refers to whitespace skipping behaviour of RegexParsers. RegexParsers 
don't provide a way to enable-disable whitespace skipping for individual 
parsers defined inside RegexParsers. There is only a class-wide skipWhitespace 
flag which disables or enables whitespace skipping for all parsers, and default 
behaviour is to skip all whitespace. So to be able to parse whitespace for 
selected parsers, the default behaviour is disabled and whitespace is skipped 
on the remaining parsers.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320774065
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -404,6 +429,30 @@ class RFormulaModel private[feature](
   s"Label column already exists and is not of type 
${NumericType.simpleString}.")
   }
 
+  private def foldExprs(dataframe: DataFrame)(f: (DataFrame, String) => 
DataFrame): DataFrame =
+resolvedFormula.evalExprs.foldLeft(dataframe)(f)
 
 Review comment:
   I couldn't quite understand this one. Did you mean save it as a temporary 
variable and return that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320772910
 
 

 ##
 File path: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaParserSuite.scala
 ##
 @@ -198,4 +198,31 @@ class RFormulaParserSuite extends SparkFunSuite {
 "Petal.Length:Petal.Width"),
   schema)
   }
+
+  test("parse skip whitespace") {
+val schema = (new StructType)
+  .add("a", "int", true)
+  .add("b", "long", false)
+  .add("c", "string", true)
+checkParse(" ~a+  b :  c  ", "", Seq("a", "b:c"))
+checkParse(" ~ a  * b", "", Seq("a", "b", "a:b"))
+checkParse("~ (  a +b  )^  2", "", Seq("a", "b", "a:b"))
+checkParse("~  .  ^ 2  - a-b  -  c", "", Seq("a:b", "a:c", "b:c"), schema)
+checkParse("~ ( a) *  ( (  (b ) : c )  )", "", Seq("a", "b:c", "a:b:c"))
+  }
+
+  test("parse functions") {
+checkParse("y ~ I(a+b) + c", "y", Seq("a+b", "c"))
+checkParse("y ~ I(a+b)*c", "y", Seq("a+b", "c", "a+b:c"))
+checkParse("y ~ (I((a+b)) + c)^2", "y", Seq("(a+b)", "c", "(a+b):c"))
+checkParse("y ~ I(log(a)*(log(a)*2)) + b", "y", Seq("log(a)*(log(a)*2)", 
"b"))
+checkParse("y ~ exp(a) + (b + c)", "y", Seq("exp(a)", "b", "c"))
+checkParse("log(y) ~ a + log(b)", "log(y)", Seq("a", "log(b)"))
+checkParse("I(c+d) ~ a + log(b)", "c+d", Seq("a", "log(b)"))
 
 Review comment:
   also added tests for those functions


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320772635
 
 

 ##
 File path: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaParserSuite.scala
 ##
 @@ -198,4 +198,31 @@ class RFormulaParserSuite extends SparkFunSuite {
 "Petal.Length:Petal.Width"),
   schema)
   }
+
+  test("parse skip whitespace") {
+val schema = (new StructType)
+  .add("a", "int", true)
+  .add("b", "long", false)
+  .add("c", "string", true)
+checkParse(" ~a+  b :  c  ", "", Seq("a", "b:c"))
+checkParse(" ~ a  * b", "", Seq("a", "b", "a:b"))
+checkParse("~ (  a +b  )^  2", "", Seq("a", "b", "a:b"))
+checkParse("~  .  ^ 2  - a-b  -  c", "", Seq("a:b", "a:c", "b:c"), schema)
+checkParse("~ ( a) *  ( (  (b ) : c )  )", "", Seq("a", "b:c", "a:b:c"))
+  }
+
+  test("parse functions") {
+checkParse("y ~ I(a+b) + c", "y", Seq("a+b", "c"))
+checkParse("y ~ I(a+b)*c", "y", Seq("a+b", "c", "a+b:c"))
+checkParse("y ~ (I((a+b)) + c)^2", "y", Seq("(a+b)", "c", "(a+b):c"))
 
 Review comment:
   Added additional tests with whitespace


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320709444
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -126,7 +127,13 @@ private[feature] trait RFormulaBase extends 
HasFeaturesCol with HasLabelCol with
 /**
  * :: Experimental ::
  * Implements the transforms required for fitting a dataset against an R model 
formula. Currently
- * we support a limited subset of the R operators, including '~', '.', ':', 
'+', '-', '*' and '^'.
+ * we support a limited subset of the R operators, including '~', '.', ':', 
'+', '-', '*', '^'
+ * and 'I()'. Arithmetic expressions which use spark functions or registered 
UDFs are
 
 Review comment:
   Since expr function is used, I think it should be Spark SQL functions so I'm 
changing it as such.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-09-04 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r320706146
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -216,9 +226,14 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override 
val uid: String)
   col
 }
 
-// First we index each string column referenced by the input terms.
+// Add evaluated expressions to the dataset
+val selectedCols = resolvedFormula.evalExprs
+  .map(col => expr(col).alias(col)) ++ dataset.columns.map(col(_))
 
 Review comment:
   Most of the time the name stays the same, but there are still some cases 
name might change. For example when casting is involved;
   
   ```scala
   import org.apache.spark.sql._
   import org.apache.spark.sql.functions._
   
   val spark = SparkSession.builder()
   .master("local")
   .getOrCreate()
   import spark.implicits._
   Seq(1, 2, 3, 4, 5).toDF.select(expr("log(value)")).show()
   
   +---+
   |LOG(E(), CAST(value AS DOUBLE))|
   +---+
   |0.0|
   | 0.6931471805599453|
   | 1.0986122886681098|
   | 1.3862943611198906|
   | 1.6094379124341003|
   +---+
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-07-17 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r304395510
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -362,6 +381,16 @@ class RFormulaModel private[feature](
 case _ => true
   }
   StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+} else if (resolvedFormula.evalExprs.contains(resolvedFormula.label)) {
+  val spark = SparkSession.builder().getOrCreate()
+  val dummyRDD = spark.sparkContext.parallelize(Seq(Row.empty))
+  val dummyDF = spark.createDataFrame(dummyRDD, schema)
+.withColumn(resolvedFormula.label, expr(resolvedFormula.label))
+  val nullable = dummyDF.schema(resolvedFormula.label).dataType match {
 
 Review comment:
   I'm resolving this as this part got removed due to the suggested changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-07-17 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r304395783
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -614,3 +652,80 @@ private object VectorAttributeRewriter extends 
MLReadable[VectorAttributeRewrite
 }
   }
 }
+
+/**
+ * Utility transformer for adding expressions to dataframe using `expr` spark 
function
+ *
+ * @param exprsToSelect set of string expressions to be added as a column to 
the dataframe.
+ *  The name of the columns will be identical to the 
expression
+ */
+private class ExprSelector(
 
 Review comment:
   Got it, thanks! I removed the hidden stage. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-07-16 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r303864360
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -362,6 +381,16 @@ class RFormulaModel private[feature](
 case _ => true
   }
   StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+} else if (resolvedFormula.evalExprs.contains(resolvedFormula.label)) {
+  val spark = SparkSession.builder().getOrCreate()
+  val dummyRDD = spark.sparkContext.parallelize(Seq(Row.empty))
+  val dummyDF = spark.createDataFrame(dummyRDD, schema)
+.withColumn(resolvedFormula.label, expr(resolvedFormula.label))
+  val nullable = dummyDF.schema(resolvedFormula.label).dataType match {
 
 Review comment:
   In order to create a schema with arbitrary types, I actually borrowed this 
pattern from here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L79
   
   However, the datatype check to determine whether the field should be 
nullable seems unnecessary here, because only `NumericType` and `BooleanType` 
seems to be supported, so I modified here to return schema without checking 
datatype. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L390
  
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2019-07-16 Thread GitBox
ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r303856796
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -614,3 +652,80 @@ private object VectorAttributeRewriter extends 
MLReadable[VectorAttributeRewrite
 }
   }
 }
+
+/**
+ * Utility transformer for adding expressions to dataframe using `expr` spark 
function
+ *
+ * @param exprsToSelect set of string expressions to be added as a column to 
the dataframe.
+ *  The name of the columns will be identical to the 
expression
+ */
+private class ExprSelector(
 
 Review comment:
   As you suspect, having an extra hidden stage isn't really essential here. I 
only added it to have less coupling between RFormula and RFormulaModel classes. 
This can be very well done without it. 
   
   Roughly, this is what RFormula and RFormulaModel classes are doing;
   ```scala
   
   class RFormula(..., formula) {
  def fit(df) = {
val parsedFormula = parse(formula)
var stages = ArrayBuffer()
   
val featureColumns = parsedFormula.terms.map {
...
stages += OneHotEncoder()
...
}
   stages += VectorAssembler(featureColumns)
   val pipeline = Pipeline(stages)
   RFormulaModel(parsedFormula, pipeline)
 }
   }
   
   class RFormulaModel(parsedFormula, pipeline) {
 def transform(df) = {
   val withFeatures = pipeline.transform(df)
   transformLabel(withFeatures)
 }
   }
   ```
   In order to assemble arithmetic expressions in a feature column with 
`VectorAssembler`, the dataframe which is transformed by `RFormulaModel` needs 
to have these columns. One way would be to simply add these transformations 
inside `RFormulaModel.transform` method,  or another way would be to add a 
pipelined stage inside `RFormula.fit` method. Seeing that all feature column 
related transformations are done at `RFormula.fit` method, I chose to add a 
pipelined stage inside `RFormula.fit` method. 
   
   But indeed, having a transformer for just executing a couple of `expr` 
functions could be too much. If you think it's unnecessary to add an extra 
stage, let me know and I'll move it's transformations to `RFormulaModel` class.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions

2019-06-25 Thread GitBox
ozancicek commented on a change in pull request #24939: 
[SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r297161375
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
 ##
 @@ -269,40 +310,78 @@ private[ml] object RFormulaParser extends RegexParsers {
   }
 
   private val intercept: Parser[Term] =
-"([01])".r ^^ { case a => Intercept(a == "1") }
+"([01])".r.n ^^ { case a => Intercept(a == "1") }
 
 Review comment:
   Got it, I replaced it with a normal function call. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions

2019-06-24 Thread GitBox
ozancicek commented on a change in pull request #24939: 
[SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r296620061
 
 

 ##
 File path: 
mllib/src/main/scala/org/apache/spark/ml/feature/RFormulaParser.scala
 ##
 @@ -269,40 +310,78 @@ private[ml] object RFormulaParser extends RegexParsers {
   }
 
   private val intercept: Parser[Term] =
-"([01])".r ^^ { case a => Intercept(a == "1") }
+"([01])".r.n ^^ { case a => Intercept(a == "1") }
 
 Review comment:
   Just introduced a shorter way of escaping whitespace around the tokens. Uses 
the implicit class defined inside the object on line 290:
   
   ```scala
 private val space = "[ \\n]*".r
   
 /* Utility function for skipping whitespace around a regex. */
 private implicit class RegexParserUtils(val r: Regex) {
   def n: Parser[String] = (space ~> r <~ space)
 }
   
   ```
   Given a regexp, returns a string parser which skips the surrounding white 
space


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions

2019-06-24 Thread GitBox
ozancicek commented on a change in pull request #24939: 
[SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r296616210
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -362,6 +381,16 @@ class RFormulaModel private[feature](
 case _ => true
   }
   StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+} else if (resolvedFormula.evalExprs.contains(resolvedFormula.label)) {
+  val spark = SparkSession.builder().getOrCreate()
+  val dummyRDD = spark.sparkContext.parallelize(Seq(Row.empty))
+  val dummyDF = spark.createDataFrame(dummyRDD, schema)
+.withColumn(resolvedFormula.label, expr(resolvedFormula.label))
+  val nullable = dummyDF.schema(resolvedFormula.label).dataType match {
+case _: NumericType | BooleanType => false
 
 Review comment:
   I thought so too! Then I got this error;
   
   ```
   [error] 
.../spark/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala:390: 
pattern type is incompatible with expected type;
   [error]  found   : org.apache.spark.sql.types.NumericType.type
   [error]  required: org.apache.spark.sql.types.DataType
   [error] Note: if you intended to match against the class, try `case _: 
NumericType`
   [error] case NumericType | BooleanType => false
   
   
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions

2019-06-24 Thread GitBox
ozancicek commented on a change in pull request #24939: 
[SPARK-18569][ML][R]Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r296614697
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##
 @@ -126,7 +127,13 @@ private[feature] trait RFormulaBase extends 
HasFeaturesCol with HasLabelCol with
 /**
  * :: Experimental ::
  * Implements the transforms required for fitting a dataset against an R model 
formula. Currently
- * we support a limited subset of the R operators, including '~', '.', ':', 
'+', '-', '*' and '^'.
+ * we support a limited subset of the R operators, including '~', '.', ':', 
'+', '-', '*', '^'
+ * and 'I()'. Arithmetic expressions which use spark functions or registered 
udf's are
 
 Review comment:
   fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org