[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 @hhbyyh thanks for the review. I see that for some classes there are ongoing PRs. Thus I cannot change them now in order to have a common place and a common test. Should I wait for those PRs to be merged then? Or were you suggesting something different? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r157414340 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * by `inputCol`. A warning will be printed if both are set. */ private[feature] def isBucketizeMultipleColumns(): Boolean = { -if (isSet(inputCols) && isSet(inputCol)) { - logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + -"`Bucketizer` only map one column specified by `inputCol`") - false +if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) || + isSet(inputCol) && isSet(outputCols)) { + throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " + +"only supports setting either `inputCol` or `inputCols`.") --- End diff -- thanks. I will add these checks while doing the changes requested by @hhbyyh. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r157415970 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala --- @@ -112,15 +112,14 @@ case class Like(left: Expression, right: Expression) extends StringRegexExpressi override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val patternClass = classOf[Pattern].getName val escapeFunc = StringUtils.getClass.getName.stripSuffix("$") + ".escapeLikeRegex" -val pattern = ctx.freshName("pattern") if (right.foldable) { val rVal = right.eval() if (rVal != null) { val regexStr = StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString())) -ctx.addMutableState(patternClass, pattern, - s"""$pattern = ${patternClass}.compile("$regexStr");""") +val pattern = ctx.addMutableState(patternClass, "patternLike", + v => s"""$v = ${patternClass}.compile("$regexStr");""", forceInline = true) --- End diff -- my only concern is about point 2. I think it is a dangerous thing to do. What if we generate a lot of frequently used variable? I think it is safer at the moment to consider only 1 and 3 in the decision whether to inline or not. In the future, with a different codegen method, we might then define a threshold over which we generate an array for the given class, otherwise we use plain variables, which IMHO would be the best option but at the moment it is not feasible... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r157518488 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala --- @@ -112,15 +112,14 @@ case class Like(left: Expression, right: Expression) extends StringRegexExpressi override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val patternClass = classOf[Pattern].getName val escapeFunc = StringUtils.getClass.getName.stripSuffix("$") + ".escapeLikeRegex" -val pattern = ctx.freshName("pattern") if (right.foldable) { val rVal = right.eval() if (rVal != null) { val regexStr = StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString())) -ctx.addMutableState(patternClass, pattern, - s"""$pattern = ${patternClass}.compile("$regexStr");""") +val pattern = ctx.addMutableState(patternClass, "patternLike", + v => s"""$v = ${patternClass}.compile("$regexStr");""", forceInline = true) --- End diff -- If 3 is a precondition for 2, then it is ok. Thanks for the explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19811 LGTM too, thanks @kiszk and @bdrillard! This is a very important PR IMHO --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r157691450 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -99,6 +99,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ ArrayType(pointType1, nullable1), t2 @ ArrayType(pointType2, nullable2)) +if t1.sameType(t2) => --- End diff -- why is this `if t1.sameType(t2)` needed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r157749463 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -99,6 +99,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ ArrayType(pointType1, nullable1), t2 @ ArrayType(pointType2, nullable2)) +if t1.sameType(t2) => + val dataType = findTightestCommonType(pointType1, pointType2).get --- End diff -- I think that the reason is for nested structures, for instance an array of array or array of map --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r157749627 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -99,6 +99,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ ArrayType(pointType1, nullable1), t2 @ ArrayType(pointType2, nullable2)) +if t1.sameType(t2) => + val dataType = findTightestCommonType(pointType1, pointType2).get --- End diff -- then @bdrillard can we add also some test cases for nested structures? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r157787266 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -99,6 +99,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ ArrayType(pointType1, nullable1), t2 @ ArrayType(pointType2, nullable2)) +if t1.sameType(t2) => + val dataType = findTightestCommonType(pointType1, pointType2).get --- End diff -- @gczsjdy I think it is needed, because IIUC in `sameType` the nullability is not checked to be the same. So it might vary and calling `findTightestCommonType` does the trick. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r157794266 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -158,11 +169,6 @@ object TypeCoercion { findTightestCommonType(t1, t2) .orElse(findWiderTypeForDecimal(t1, t2)) .orElse(stringPromotion(t1, t2)) - .orElse((t1, t2) match { -case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) => - findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2)) -case _ => None - }) --- End diff -- it makes sense, but I'd love that in your implementation in `findTightestCommonType`, you would replicate this logic, ie. removing the `sameType` guard and using `findWiderTypeForTwo`, in order to allow casting an array of int to an array of long. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20023 [SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL ## What changes were proposed in this pull request? When there is an operation between Decimals and the result is a number which is not representable exactly with the result's precision and scale, Spark is returning `NULL`. This was done to reflect Hive's behavior, but it is against SQL ANSI 2011, which states that "If the result cannot be represented exactly in the result type, then whether it is rounded or truncated is implementation-defined". Moreover, Hive now changed its behavior in order to respect the standard, thanks to HIVE-15331. Therefore, the PR propose to: - update the rules to determine the result precision and scale according to the new Hive's ones introduces in HIVE-15331, which reflect SQLServer behavior; - round the result of the operations, when it is not representable exactly with the result's precision and scale, instead of returning `NULL` ## How was this patch tested? modified and added UTs. Comparisons with results of Hive and SQLServer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22036 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20023.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20023 commit 3037d4aa6afc4d7630d86d29b8dd7d7d724cc990 Author: Marco Gaido Date: 2017-12-17T21:45:06Z [SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157809630 --- Diff: examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala --- @@ -49,12 +49,10 @@ object DFSReadWriteTest { } private def printUsage(): Unit = { -val usage: String = "DFS Read-Write Test\n" + -"\n" + -"Usage: localFile dfsDir\n" + -"\n" + -"localFile - (string) local file to use in test\n" + -"dfsDir - (string) DFS directory for read/write tests\n" +val usage = s"""DFS Read-Write Test --- End diff -- here you should use ``` """ |... """.stripMargin ``` otherwise you introduce a lot of spaces --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157810171 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala --- @@ -95,7 +95,7 @@ object LocalALS { def showWarning() { System.err.println( - """WARN: This is a naive implementation of ALS and is given as an example! + s"""WARN: This is a naive implementation of ALS and is given as an example! --- End diff -- please here revert the change and do the same in all similar places, since there is no variable to interpolate --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20021#discussion_r157812984 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -930,6 +930,16 @@ class CodegenContext { // inline execution if only one block blocks.head } else { + if (Utils.isTesting) { +// Passing global variables to the split method is dangerous, as any mutating to it is +// ignored and may lead to unexpected behavior. +val mutableStateNames = mutableStates.map(_._2).toSet --- End diff -- after finally merging SPARK-18016, this should be the union of `arrayCompactedMutableStates.flatMap(_.arrayNames)` and `inlinedMutableStates.map(_._2)`, I think --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19340 hi @Kevin-Ferret , thanks for looking at this. Yes, you are right, I have not changed the method for updating the centroids. The current methods seems to me the most widely adopted for cosine similarity too. Indeed, the same approach is used in RapidMiner (https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/k_means.html) and also in this paper (https://s3.amazonaws.com/academia.edu.documents/32952068/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1513706450&Signature=MFPcahadw35IpP2o0v%2F51xW7KOM%3D&response-content-disposition=inline%3B%20filename%3DSimilarity_Measures_for_Text_Document_Cl.pdf). I think this is the right place where to discuss it, since it is related to the implementation I am proposing in the PR. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20021#discussion_r157820471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -930,6 +930,18 @@ class CodegenContext { // inline execution if only one block blocks.head } else { + if (Utils.isTesting) { +// Passing global variables to the split method is dangerous, as any mutating to it is +// ignored and may lead to unexpected behavior. +// We don't need to check `arrayCompactedMutableStates` here, as it results to array access --- End diff -- what if we declare a variable with the same name of an `arrayCompactedMutableStates `? Let's say that we have: ``` public class Foo { private Object[] ourArray; // private void ourMethod() { Object[] ourArray = new Object[1]; ourSplitFunction(ourArray); } private void ourSplitFunction(Object[] ourArray) { ourArray[0] = null; } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20021#discussion_r157822361 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -930,6 +930,18 @@ class CodegenContext { // inline execution if only one block blocks.head } else { + if (Utils.isTesting) { +// Passing global variables to the split method is dangerous, as any mutating to it is +// ignored and may lead to unexpected behavior. +// We don't need to check `arrayCompactedMutableStates` here, as it results to array access --- End diff -- I don't think so, but currently there is also no place which creates the problem for which this assertion is being introduces. Of course this case is very very unlikely, but since we are introducing the check, I think that the effort to ensure also this very remote corner case is very low... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 @hhbyyh I created a common infrastructure. Please let me know if I have to modify it. Meanwhile, I'd like to discuss where to put the common UTs: do you have any specific idea about the right place? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 @hhbyyh thanks for the comments. I already fixed all your comments, but I am waiting to push for the UT. Honestly I think that `checkParam` is not the best place. Checking the exception requires setting the parameters and invoking `transform`, thus I am not sure it is the best place, since it is a very generic one and at the moment we are setting no parameter there. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20021#discussion_r157962547 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -930,6 +930,18 @@ class CodegenContext { // inline execution if only one block blocks.head } else { + if (Utils.isTesting) { --- End diff -- as you said, it _may_ lead, but likely it doesn't. Then I do think that the best option is to assert it only in testing, where this might help finding _potential_ bugs. In production it is an overkill to throw an exception for a situation which most likely is not a problem IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r157967667 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -158,11 +169,6 @@ object TypeCoercion { findTightestCommonType(t1, t2) .orElse(findWiderTypeForDecimal(t1, t2)) .orElse(stringPromotion(t1, t2)) - .orElse((t1, t2) match { -case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) => - findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2)) -case _ => None - }) --- End diff -- I like your idea @gczsjdy! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @cloud-fan @dongjoon-hyun @gatorsmile @rxin @viirya I saw you worked on this files. Maybe you can help reviewing the PR. For further details about the reasons of this PR, please refer to the e-mail I sent on the dev mail list. Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @cloud-fan yes, Hive changed and most important at the moment we are not compliant with SQL standard. So currently Spark is returning results which are different from Hive and not compliant with SQL standard. This is why I proposed this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20021 Honestly, I liked very much doing the test only for testing and not throwing an exception in production. IMHO it is an overkill to throw an exception in production and in the remote case that we happen to forget one place where this check can throw the exception, but it is not an issue, as it is perfectly possible, this would also cause a regression. Thus, honestly I am strongly against this solution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 kindly ping @cloud-fan @gatorsmile @HyukjinKwon @zero323 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 thanks @hhbyyh, I updated the PR according to your suggestion and previous comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 thanks for looking at this @hvanhovell. The reasons why I didn't introduce a configuration variable for this behavior are: 1. As far as I know, currently there is no way to read reliably a configuration in catalyst; 2. Also in Hive, the behavior was changed without introducing any configuration to switch back to the previous behavior; 3. Many people are complaining about the current Spark behavior in the JIRA and therefore it seems that the previous behavior is neither desired nor useful to users. Let me know if you don't agree with these arguments. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @hvanhovell, as far as 1 is regarded, I was referring to [this comment](https://github.com/apache/spark/pull/19449#pullrequestreview-67789784) and [this PR](https://github.com/apache/spark/pull/18568) where it is explicitly stated that using `SQLConf` here is not safe and it shouldn't be done. Let me know if I am missing something. I am sorry, but I think I haven't fully understood what you meant by > Or you can wire up the rules using the SessionStateBuilder may I kindly ask you if you could elaborate this sentence a bit more? Thank you very much. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158157137 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -1526,15 +1526,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sql("select 10.30 * 3.00"), Row(BigDecimal("30.9000", new MathContext(38 checkAnswer(sql("select 10.30 * 3.000"), - Row(null)) --- End diff -- The third case is never checked in the current codebase, ie. when we go out of the representable range of values. I haven't added a test for it, because I was waiting for feedbacks by the community about how to handle the 3rd case and I focused this PR only on points 1 and 2. But I can add a test case for it and eventually change it in a future PR to address the 3rd point in the e-mail. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r158158375 --- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala --- @@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite { require(copyReturnType === obj.getClass, s"${clazz.getName}.copy should return ${clazz.getName} instead of ${copyReturnType.getName}.") } + + /** + * Checks that the class throws an exception in case both `inputCols` and `inputCol` are set and + * in case both `outputCols` and `outputCol` are set. + * These checks are performed only whether the class extends respectively both `HasInputCols` and + * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`. + * + * @param paramsClass The Class to be checked + * @param spark A `SparkSession` instance to use + */ + def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { +import spark.implicits._ +// create fake input Dataset +val feature1 = Array(-1.0, 0.0, 1.0) +val feature2 = Array(1.0, 0.0, -1.0) +val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2") --- End diff -- The reason why I created the dataframe inside the method was to control the names of the columns it has. Otherwise we can't ensure that those columns exist. I think that the type check is performed later, thus it is not a problem here. What do you think? I preferred to use `paramsClass: Class[_ <: Params]` because I need a clean instance for each of the two checks: if an instance is passed I cannot enforce that it is clean, ie. some parameters weren't already set and I would need to copy it to create new instances as well, since otherwise the second check would be influenced by the first one. What do you think? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 Thank you for your review @dongjoon-hyun. I think what we can do is add more test to the whitelist in `HiveCompatibilitySuite`, updating them according to HIVE-15331. Were you thinking to this or something different? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158224551 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 --- End diff -- I can add it even though it is not needed... there is no way we can violate those constraints. If you believe it is better to use assert, I will do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158225279 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala --- @@ -243,17 +248,43 @@ object DecimalPrecision extends TypeCoercionRule { // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case b @ BinaryOperator(left, right) if left.dataType != right.dataType => - (left.dataType, right.dataType) match { -case (t: IntegralType, DecimalType.Fixed(p, s)) => - b.makeCopy(Array(Cast(left, DecimalType.forType(t)), right)) -case (DecimalType.Fixed(p, s), t: IntegralType) => - b.makeCopy(Array(left, Cast(right, DecimalType.forType(t -case (t, DecimalType.Fixed(p, s)) if isFloat(t) => - b.makeCopy(Array(left, Cast(right, DoubleType))) -case (DecimalType.Fixed(p, s), t) if isFloat(t) => - b.makeCopy(Array(Cast(left, DoubleType), right)) -case _ => - b - } + nondecimalLiteralAndDecimal(b).lift((left, right)).getOrElse( +nondecimalNonliteralAndDecimal(b).applyOrElse((left.dataType, right.dataType), + (_: (DataType, DataType)) => b)) } + + /** + * Type coercion for BinaryOperator in which one side is a non-decimal literal numeric, and the + * other side is a decimal. + */ + private def nondecimalLiteralAndDecimal( --- End diff -- Yes, it is. If we don't introduce this, we have a failure in Hive compatibility tests, because Hive use the exact precision and scale needed by the literals, while we, before this change, were using conservative values for each type. For instance, if we have a `select 123.12345*3`, before this change `3` would have been interpreted as `Decimal(10, 0)`, which is the type for integers. After the change, `3` would become `Decimal(1, 0)`, as Hive does. This prevents from needing more precision that what is actually needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158225412 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { --- End diff -- yes, please see my comment above for an example. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158225505 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) --- End diff -- No, please see my comments above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158225632 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) --- End diff -- this is prevented outside this function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158225832 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) +} else { + // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION. + val intDigits = precision - scale + // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise + // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits + val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE) --- End diff -- It is the `MINIMUM_ADJUSTED_SCALE`. We can't have a scale lower that that, even though we would need not to loose precision. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158226546 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) +} else { + // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION. + val intDigits = precision - scale + // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise + // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits + val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE) + val adjustedScale = Math.max(MAX_PRECISION - intDigits, minScaleValue) --- End diff -- It is `max` because we take either the scale which would prevent a loss of "space" for `intDigits`, ie. the part on the left of the dot, or the `minScaleValue`, which is the scale we are ensuring to provide at least. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @gatorsmile, please refer to the [e-mail to the dev mail list](https://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCAEorWNAJ4TxJR9NBcgSFMD_VxTg8qVxusjP%2BAJP-x%2BJV9zH-yA%40mail.gmail.com%3E) for further details. I run the script I added to the tests in this PR, the results are: - Hive behaves exactly as Spark after this PR; - SQLServer the same, even though on additions and subtractions it seems to maintain one more precision digit in some cases (I am running SQLServer 2017, since Hive implementation, and therefore this too, are inspired to SQLServer2005, there might have been a small behavior change in this case). Anyway, differently from Hive and Spark it throws an exception in case 3 described in the email (it is compliant to SQL standard, point 3 of the email is out of scope of this PR, I will create another PR for it once we agree on how to handle that case); - Oracle and Postgres have nearly infinite precision. Thus it is nearly impossible to provoke a rounding on them. If we force a precision loss on them (point 3 of the email, out of scope of this PR) they throw an exception (compliant to SQL standard and SQLServer); Here you are the outputs of the queries. **Hive 2.3.0 (same as Spark after PR)** ``` 0: jdbc:hive2://localhost:1> create table decimals_test(id int, a decimal(38,18), b decimal(38,18)); No rows affected (2.085 seconds) 0: jdbc:hive2://localhost:1> insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789); No rows affected (14.054 seconds) 0: jdbc:hive2://localhost:1> select id, a+b, a-b, a*b, a/b from decimals_test order by id; +-+---+---+++ | id | _c1 | _c2 |_c3 |_c4 | +-+---+---+++ | 1 | 1099.0| -899.0 | 99900.00 | 0.100100 | | 2 | 24690.24600 | 0E-17 | 152402061.885129 | 1.00 | | 3 | 1234.2234567891011| -1233.9765432108989 | 152.358023 | 0.000100 | | 4 | 123456789123456790.12345678912345679 | 123456789123456787.87654321087654321 | 138698367904130467.515623 | 109890109097814272.043109 | +-+---+---+++ ``` **SQLServer 2017** ``` 1> create table decimals_test(id int, a decimal(38,18), b decimal(38,18)); 2> insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789); 3> select id, a+b, a-b, a*b, a/b from decimals_test order by id; 4> GO (4 rows affected) id --- 1 1099.00 -899.00 99900.00 .100100 2 24690.246000 .00 152402061.885129 1.00 3 1234.22345678910110 -1233.97654321089890 152.358023 .000100 4123456789123456790.123456789123456789 123456789123456787.876543210876543211138698367904130467.515623 109890109097814272.043109 ``` **Postgres and Oracle** ``` postgres=# create table decimals_test(id int, a decimal(38,18), b decimal(38,18)); CREATE TABLE postgres=# insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789); INSERT 0 4 postgres=# select id, a+
[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20021 LGTM too, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19940 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r158242030 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + // scalastyle:off line.size.limit + /** + * Decimal implementation is based on Hive's one, which is itself inspired to SQLServer's one. + * In particular, when a result precision is greater than {@link #MAX_PRECISION}, the + * corresponding scale is reduced to prevent the integral part of a result from being truncated. + * + * For further reference, please see + * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/. + * + * @param precision + * @param scale + * @return + */ + // scalastyle:on line.size.limit + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { +// Assumptions: +// precision >= scale +// scale >= 0 +if (precision <= MAX_PRECISION) { + // Adjustment only needed when we exceed max precision + DecimalType(precision, scale) +} else { + // Precision/scale exceed maximum precision. Result must be adjusted to MAX_PRECISION. + val intDigits = precision - scale + // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise + // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits + val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE) --- End diff -- Yes, sorry, my answer was very poor, I will rephrase. `scale` contains the scale which we need to represent the values without any precision loss. What we are doing here is saying that the lower bound for the scale is either the scale that we need to correctly represent the value or the `MINIMUM_ADJUSTED_SCALE`. After this, in the line below we state that the scale we will use is the max between the number of digits of the precision we don't need on the left of the dot and this `minScaleValue`: ie. even though in some cases we might need a scale higher than `MINIMUM_ADJUSTED_SCALE`, but the number of digits needed on the left on the dot would force us to have a scale lower than `MINIMUM_ADJUSTED_SCALE`, we enforce that we will maintain at least `MINIMUM_ADJUSTED_SCALE`. We can't let the scale be lower that this threshold, even though it would be needed to enforce that we don't loose digits on the left of the dot. Please refer also to the blog post I linked in the comment above fo r further (hopefully better) explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19940 the test errors are unrelated to this change. Any other comments @cloud-fan @kiszk @viirya ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19940 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19940: [SPARK-22750][SQL] Reuse mutable states when poss...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19940#discussion_r158336014 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1193,7 +1196,8 @@ case class ToUTCTimestamp(left: Expression, right: Expression) val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") val tzTerm = ctx.addMutableState(tzClass, "tz", v => s"""$v = $dtu.getTimeZone("$tz");""") -val utcTerm = ctx.addMutableState(tzClass, "utc", +val utcTerm = "tzUTC" +ctx.addImmutableStateIfNotExists(tzClass, utcTerm, v => s"""$v = $dtu.getTimeZone("UTC");""") --- End diff -- yes, there is no difference between them in practice. But I think that being consistent would be better for readability --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20043: [SPARK-22856][SQL] Add wrappers for codegen outpu...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20043#discussion_r158361160 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala --- @@ -321,7 +322,12 @@ case class IsNull(child: Expression) extends UnaryExpression with Predicate { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val eval = child.genCode(ctx) -ExprCode(code = eval.code, isNull = "false", value = eval.isNull) +val value = if ("true" == s"${eval.isNull}" || "false" == s"${eval.isNull}") { --- End diff -- or `eval.isNull == Literal("true")`? Or even better we can create a `LiteralTrue = Literal("true")` and equivalent for false and use them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20043: [SPARK-22856][SQL] Add wrappers for codegen outpu...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20043#discussion_r158361417 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -56,7 +56,36 @@ import org.apache.spark.util.{ParentClassLoader, Utils} * @param value A term for a (possibly primitive) value of the result of the evaluation. Not * valid if `isNull` is set to `true`. */ -case class ExprCode(var code: String, var isNull: String, var value: String) +case class ExprCode(var code: String, var isNull: ExprValue, var value: ExprValue) + + +// An abstraction that represents the evaluation result of [[ExprCode]]. +abstract class ExprValue --- End diff -- IMHO I prefer this approach because in the future we might need to distinguish these two cases, thus I think is a good thing to let them be distinct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 thanks @hhbyyh, I removed the 2 too strict checks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @gatorsmile I answered to your comments about DB2 in the e-mail. @cloud-fan that would help, but not solve the problem. It would just make the problem being generated by bigger numbers. As you can see from the e-mail, DB2 behavior is actually in accordance to SQL standards and the other DBs, it just have a smaller maximum precision. And the case of throwing an exception is point 3 of my e-mails and it is out of scope of this PR, because I think we best discuss before which is the right approach in that case and then I can eventually create a PR. Therefore, also DB2 behavior is aligned to the other SQL engines, the SQL standard and Spark is the only one which is currently behaving differently. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 @gatorsmile I added the test, but I didn't get what needs to be updated in `registerPython`. May you explain me please? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158539355 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala --- @@ -29,9 +29,12 @@ case class PythonUDF( func: PythonFunction, dataType: DataType, children: Seq[Expression], -evalType: Int) +evalType: Int, +udfDeterministic: Boolean = true) --- End diff -- no, it is not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 thank you @cloud-fan, changed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158579661 --- Diff: python/pyspark/sql/tests.py --- @@ -434,6 +434,16 @@ def test_udf_with_array_type(self): self.assertEqual(list(range(3)), l1) self.assertEqual(1, l2) +def test_nondeterministic_udf(self): +from pyspark.sql.functions import udf +import random +udf_random_col = udf(lambda: int(100 * random.random()), IntegerType()).asNondeterministic() +df = self.spark.createDataFrame([Row(1)]).select(udf_random_col().alias('RAND')) +random.seed(1234) +udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) +[row] = df.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).collect() +self.assertEqual(row[0] + 10, row[1]) --- End diff -- I am not sure why, but setting the seed doesn't seem to take effect. I will remove setting the seed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158579754 --- Diff: python/pyspark/sql/functions.py --- @@ -2075,9 +2075,10 @@ class PandasUDFType(object): def udf(f=None, returnType=StringType()): --- End diff -- I followed what was done for scala UDF, where this parameter is not added, but there is a method to add it. If we add a parameter here, I'd then suggest to add it also to the scala API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @gatorsmile that is scenario 3. I will explain you why and after I will do and errata corrige of the summary I did in my last e-mail, because I made a mistake about how DB2 computes the result precision and scale, sorry for that. Anyway, what you showed is an example of point 3 because DB2 computes the result type as `DECIMAL( MIN(31, p1 + p2), MIN(31, s1 + s2) )`. Therefore, in your case the result type was `DECIMAL(31, 31)`. Since your result had more than 0 significant digits, it was out out of the range of the representable values and an overflow exception was thrown. You can reproduce case 2 as follows: ``` db2 => create table decimals_test (id int, a decimal(31,31), b decimal(31,31)) DB2I The SQL command completed successfully. db2 => insert into decimals_test values(1, 0.12345678912345678912345689, 0.12345678912345678912345) DB2I The SQL command completed successfully. db2 => select a*b from dd 1 - .0152415787806736785461049526020 ``` As you can see a truncation occurred. Now, let me amend my table to summarize the behavior of the many DBs: 1. **Rules to determine precision and scale** - *Hive, SQLServer (and Spark after the PR)*: I won't include the exact formulas, anyway the relevant part is that in case of precision higher that the maximum value, we use the maximum available value (38) as precision and the maximum between the needed scale (computing according the relevant formula) and a minimum value guaranteed for the scale which is 6. - *DB2*: computes the result type as `DECIMAL( MIN(31, p1 + p2), MIN(31, s1 + s2) )`. - *Postgres and Oracle*: NA - *SQL ANSI 2011*: no indication - *Spark now*: if the precision needed is more than 38, use 38 as precision; use the needed scale without any adjustment. 2. **Behavior in case of precision loss but result in the range of the representable values** - *Oracle, Hive, SQLServer (and Spark after the PR)*: round the result. - *DB2*: truncates the result (and sets a warning flag). - *Postgres*: NA, it has infinite precision... - *SQL ANSI 2011*: either truncate or round the value. - *Spark now*: returns NULL. 3. **Behavior in case of result out of the range of the representable values (i.e overflow)** - *Oracle, DB2, SQLServer*: throw an exception. - *Postgres*: NA, they have nearly infinite precision... - *SQL ANSI 2011*: an exception should be raised - *Spark now, Hive*: return NULL (for Hive, there is a open ticket to make it compliant to the SQL standard). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 Thanks for your analysis @gatorsmile. Actually the rule you specified for Oracle is what it uses when casting, rather then when doing arithmetic operations. Yes DB2 has rather different rules to define the output type of operations. Anyway, we can have a behavior practically identical to DB2 by changing the value of `MINIMUM_ADJUSTED_SCALE` to 31. Therefore, I'd propose, instead of using the configuration you pointed out, to use a configuration for the `MINIMUM_ADJUSTED_SCALE`, changing which we can have both the behavior of Hive and SQLServer and the one of DB2. What do you think? The reason why I am suggesting this is that my first concern is not Hive compliance, but SQL standard compliance. Indeed, as you con see from the summary, on point 1 there is not a uniform behavior (but this is OK to SQL standard since it gives freedom). But on point 2 we are the only ones who are not compliant to SQL standard. And having this behavior by default doesn't look the right thing to do IMHO. On point 3, only we and Hive are not compliant. Thus I think also that should be changed. But in that case, we can't use the same flag, because it would be inconsistent. What do you think? I can understand and agree that loosing precision looks scary. But to me returning `NULL` is even more scary if possible: indeed, `NULL` is what should be returned if either if the two operands are `NULL`. Thus queries running on other DBs which relies on this might return very bad result. For instance, let's think to a report where we join a prices table and a sold_product table per country. In this use case, we can assume that if the result is `NULL`, it means that there was no sold product in that country and then coalesce the output of the multiplication to 0. This would work well on any DB but Spark. With my proposal of tuning the `MINIMUM_ADJUSTED_SCALE`, each customer can decide (query by query) how much precision loss they can tolerate. And if we agree to change point 3 behavior to the SQL standard, in case of it is not possible to meet their desires we throw an exception, giving them the choice about what to do: allow more precision loss, change their input data type, et c. etc. This is the safer way IMHO. I would ne happy to help improving test cases. May I just kindly ask you how you meant to do that? What would you like to be tested more? Would you like me to add more test cases in scope of this PR or to open a new one for that? Thank you for your time reading my long messages. I just want to take the best choice and give you all the elements I have to decide for the best all together. Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 Thanks @gatorsmile. Then should I create a follow up PR for #20008 in order to cover the cases 2 and 3 before going on with this PR or can we go on with this PR and the test cases added in this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 @gatorsmile, yes, the reason why seed doesn't work is in the way Python UDFs are executed, i.e. a new python process is created for each partition to evaluate the Python UDF. Thus the seed is set only on the driver, but not in the process where the UDF is executed. What I am saying can be easily confirmed by this: ``` >>> from pyspark.sql.functions import udf >>> import os >>> pid_udf = udf(lambda: str(os.getpid())) >>> spark.range(2).select(pid_udf()).show() +--+ |()| +--+ | 4132| | 4130| +--+ >>> os.getpid() 4070 ``` Therefore there is no easy way to set the seed. If I set it inside the UDF, the UDF would become deterministic. Therefore I think that the best option is the current test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22901][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 @gatorsmile done, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20084: [SPARK-22904][SQL] Add tests for decimal operatio...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20084 [SPARK-22904][SQL] Add tests for decimal operations and string casts ## What changes were proposed in this pull request? Test coverage for arithmetic operations leading to: 1. Precision loss 2. Overflow Moreover, tests for casting bad string to other input types and for using bad string as operators of some functions. ## How was this patch tested? added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22904 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20084.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20084 commit 7a9650aa4737768a4428de65841050ca8b157528 Author: Marco Gaido Date: 2017-12-26T17:44:01Z [SPARK-22904][SQL] Add tests for decimal operations and string casts --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @gatorsmile created #20084 for adding tests. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 @MLnick @viirya @WeichenXu123 any further comments on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20084: [SPARK-22904][SQL] Add tests for decimal operations and ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20084 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158843869 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -158,11 +213,8 @@ object TypeCoercion { findTightestCommonType(t1, t2) .orElse(findWiderTypeForDecimal(t1, t2)) .orElse(stringPromotion(t1, t2)) - .orElse((t1, t2) match { -case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) => - findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2)) -case _ => None - }) + .orElse(findWiderTypeForTwoComplex(t1, t2, findWiderTypeForTwo)) + --- End diff -- this line should be removed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158844162 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -389,6 +389,25 @@ class TypeCoercionSuite extends AnalysisTest { widenTest(StringType, MapType(IntegerType, StringType, true), None) widenTest(ArrayType(IntegerType), StructType(Seq()), None) +widenTest( + ArrayType(StringType, containsNull=true), --- End diff -- please add a whitespace before and after `=` and the same in the following lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158844322 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -389,6 +389,25 @@ class TypeCoercionSuite extends AnalysisTest { widenTest(StringType, MapType(IntegerType, StringType, true), None) widenTest(ArrayType(IntegerType), StructType(Seq()), None) +widenTest( + ArrayType(StringType, containsNull=true), + ArrayType(StringType, containsNull=false), + Some(ArrayType(StringType, containsNull=true))) +widenTest( + MapType(StringType, StringType, valueContainsNull=true), + MapType(StringType, StringType, valueContainsNull=false), + Some(MapType(StringType, StringType, valueContainsNull=true))) --- End diff -- please add a test also for struct --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158844437 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -389,6 +389,25 @@ class TypeCoercionSuite extends AnalysisTest { widenTest(StringType, MapType(IntegerType, StringType, true), None) widenTest(ArrayType(IntegerType), StructType(Seq()), None) +widenTest( + ArrayType(StringType, containsNull=true), + ArrayType(StringType, containsNull=false), + Some(ArrayType(StringType, containsNull=true))) +widenTest( + MapType(StringType, StringType, valueContainsNull=true), + MapType(StringType, StringType, valueContainsNull=false), + Some(MapType(StringType, StringType, valueContainsNull=true))) + +widenTest( + StructType(StructField("a", ArrayType(StringType, containsNull=true)) :: Nil), + StructType(StructField("a", ArrayType(StringType, containsNull=false)) :: Nil), + Some(StructType(StructField("a", ArrayType(StringType, containsNull=true)) :: Nil))) +widenTest( + StructType(StructField("a", MapType(StringType, StringType, valueContainsNull=true)) :: Nil), + StructType(StructField("a", MapType(StringType, StringType, valueContainsNull=false)) :: Nil), + Some(StructType( +StructField("a", MapType(StringType, StringType, valueContainsNull=true)) :: Nil))) --- End diff -- thanks, may you please add also two tests having Map and Array as outer types? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158844477 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -469,12 +488,21 @@ class TypeCoercionSuite extends AnalysisTest { ArrayType(ArrayType(IntegerType), containsNull = false), ArrayType(ArrayType(LongType), containsNull = false), Some(ArrayType(ArrayType(LongType), containsNull = false))) +widenTestWithStringPromotion( + StructType(StructField("a", ArrayType(LongType)) :: Nil), + StructType(StructField("a", ArrayType(StringType)) :: Nil), + Some(StructType(StructField("a", ArrayType(StringType)) :: Nil))) --- End diff -- please add an analogous test also for map and array --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158844501 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -469,12 +488,21 @@ class TypeCoercionSuite extends AnalysisTest { ArrayType(ArrayType(IntegerType), containsNull = false), ArrayType(ArrayType(LongType), containsNull = false), Some(ArrayType(ArrayType(LongType), containsNull = false))) +widenTestWithStringPromotion( + StructType(StructField("a", ArrayType(LongType)) :: Nil), + StructType(StructField("a", ArrayType(StringType)) :: Nil), + Some(StructType(StructField("a", ArrayType(StringType)) :: Nil))) + // Without string promotion widenTestWithoutStringPromotion(IntegerType, StringType, None) widenTestWithoutStringPromotion(StringType, TimestampType, None) widenTestWithoutStringPromotion(ArrayType(LongType), ArrayType(StringType), None) widenTestWithoutStringPromotion(ArrayType(StringType), ArrayType(TimestampType), None) +widenTestWithoutStringPromotion( + StructType(StructField("a", ArrayType(LongType)) :: Nil), + StructType(StructField("a", ArrayType(StringType)) :: Nil), + None) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r158844539 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -469,12 +488,21 @@ class TypeCoercionSuite extends AnalysisTest { ArrayType(ArrayType(IntegerType), containsNull = false), ArrayType(ArrayType(LongType), containsNull = false), Some(ArrayType(ArrayType(LongType), containsNull = false))) +widenTestWithStringPromotion( + StructType(StructField("a", ArrayType(LongType)) :: Nil), + StructType(StructField("a", ArrayType(StringType)) :: Nil), + Some(StructType(StructField("a", ArrayType(StringType)) :: Nil))) + --- End diff -- this empty line is not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20010 @bdrillard yes, after addressing my last comments I think we can build this. My only concern is that we are not covering the case in which there are nested complex structures in the wider case. But maybe we can address this later. I don't have permissions to trigger a build. You might check who are the last committers working on this and maybe ask them. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20029 What it seems is never closed by your analysis is the client used to interact with the metastore. This might be a problem which we are not aware of in normal SQL applications, since we have only one client in those cases. What you are doing in your fix is avoiding creating a client for each `HiveSessionBuilder`, thus: 1. this would mean that we are creating more than one `SessionBuilder`, ie. more than one `SparkSession`, which is not true as far as I know. 2. any session would share the same client to connect to the metastore, which is wrong IMHO. Please let me know if I misunderstood or I was wrong with something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 @cloud-fan thanks for your feedback. Honestly I don't have a string opinion on how we should behave in that case. That's why I wanted some feedbacks from the community on that point, while this PR covers only the other case, ie. rounding in case of precision loss in accordance to Hive and SQLServer's behavior. Anyway, IIUC now this PR is blocked by #18853 to introduce `spark.sql.typeCoercion.mode` and use that property to decide whether to keep the previous behavior or the one proposed here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20029 > The hiveClient created for the resourceLoader is only used to addJar, which is, in turn, to add Jar to the shared IsolatedClientLoader. Then we can just use the shared hive client for this purpose. @liufengdb does it mean that we are creating more than one SparkSession in the thriftserver? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20010 @gczsjdy @bdrillard the test errors are valid. In some cases an exception is expected to be thrown, but it isn't, due to the fix. So they should be updated accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19993 @jkbradley maybe you can also help reviewing this, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19492 kindly ping @viirya @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r159119696 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -148,6 +160,61 @@ object TypeCoercion { case (l, r) => None } + /** + * Case 2 type widening over complex types. `widerTypeFunc` is a function that finds the wider + * type over point types. The `widerTypeFunc` specifies behavior over whether types should be + * promoted to StringType. + */ + private def findWiderTypeForTwoComplex( + t1: DataType, + t2: DataType, + widerTypeFunc: (DataType, DataType) => Option[DataType]): Option[DataType] = { +(t1, t2) match { + case (_, _) if t1 == t2 => Some(t1) + case (NullType, _) => Some(t1) + case (_, NullType) => Some(t1) + + case (ArrayType(pointType1, nullable1), ArrayType(pointType2, nullable2)) => +val dataType = widerTypeFunc.apply(pointType1, pointType2) + +dataType.map(ArrayType(_, nullable1 || nullable2)) + + case (MapType(keyType1, valueType1, nullable1), MapType(keyType2, valueType2, nullable2)) => +val keyType = widerTypeFunc.apply(keyType1, keyType2) +val valueType = widerTypeFunc.apply(valueType1, valueType2) + +if (keyType.nonEmpty && valueType.nonEmpty) { + Some(MapType(keyType.get, valueType.get, nullable1 || nullable2)) +} else { + None +} + + case (StructType(fields1), StructType(fields2)) => +val fieldTypes = fields1.zip(fields2).map { case (f1, f2) => --- End diff -- this requires that fields with the same name must also be in the same position. Is this assumption correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 thanks @gatorsmile, do you have any suggestion for the conf name? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19224: [SPARK-20990][SQL] Read all JSON documents in files when...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19224 yes, next week I will rebase it on top of current master. Thanks @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160114031 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -102,10 +102,12 @@ case class HashAggregateExec( val beforeAgg = System.nanoTime() val hasInput = iter.hasNext - val res = if (!hasInput && groupingExpressions.nonEmpty) { -// This is a grouped aggregate and the input iterator is empty, + val res = if (!hasInput && (groupingExpressions.nonEmpty || resultExpressions.isEmpty)) { +// The input iterator is empty, and this is a grouped aggregate or without result columns, // so return an empty iterator. Iterator.empty + } else if (!hasInput && resultExpressions.isEmpty) { --- End diff -- This is covered in the previous condition, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20010#discussion_r160115072 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -462,27 +507,139 @@ class TypeCoercionSuite extends AnalysisTest { ArrayType(DoubleType, containsNull = false), Some(ArrayType(DoubleType, containsNull = true))) widenTestWithStringPromotion( - ArrayType(TimestampType, containsNull = false), - ArrayType(StringType, containsNull = true), + ArrayType(ArrayType(IntegerType), containsNull = true), + ArrayType(ArrayType(LongType), containsNull = false), + Some(ArrayType(ArrayType(LongType), containsNull = true))) --- End diff -- I think this is more coherent with the behavior in other parts and this is the behavior I would expect. But I think that we should follow @gatorsmile's suggestion and check Hive's behavior first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19993#discussion_r160143590 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +250,27 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** + * Checks that either inputCols and outputCols are set or inputCol and outputCol are set. If + * this is not true, an `IllegalArgumentException` is raised. + * @param model + */ + private[spark] def checkMultiColumnParams(model: Params): Unit = { +model match { + case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => +raiseIncompatibleParamsException("inputCols", "inputCol") + case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && m.isSet(m.outputCol) => +raiseIncompatibleParamsException("outputCols", "outputCol") + case _ => +} --- End diff -- I think we can use that method once merged, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r160147366 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType { val MAX_SCALE = 38 val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18) val USER_DEFAULT: DecimalType = DecimalType(10, 0) + val MINIMUM_ADJUSTED_SCALE = 6 --- End diff -- Yes, I followed Hive's implementation which works like this and applies this 6 digits minimum to all operations. This means that SQLServer allows to round more digits than us in those cases, ie. we ensure at least 6 digits for the scale, while SQLServer doesn't. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19492 any more comments on this @viirya @HyukjinKwon? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19224: [SPARK-20990][SQL] Read all JSON documents in files when...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19224 @gatorsmile I think now it is ready for review, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20189: [SPARK-22975] MetricsReporter should not throw ex...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20189 [SPARK-22975] MetricsReporter should not throw exception when there was no progress reported ## What changes were proposed in this pull request? `MetricsReporter ` assumes that there has been some progress for the query, ie. `lastProgress` is not null. If this is not true, as it might happen in particular conditions, a `NullPointerException` can be thrown. The PR checks whether there is a `lastProgress` and if this is not true, it returns a default value for the metrics. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22975 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20189.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20189 commit 1185a513bd807df03a24b22370903d17301fd415 Author: Marco Gaido Date: 2018-01-08T22:28:12Z [SPARK-22975] MetricsReporter should not throw exception when there was no progress reported --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r160361785 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType { val MAX_SCALE = 38 val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18) val USER_DEFAULT: DecimalType = DecimalType(10, 0) + val MINIMUM_ADJUSTED_SCALE = 6 --- End diff -- @gatorsmile what about `spark.sql.decimalOperations.mode` which defaults to `native` and accepts also `hive` (and in future also `sql2011` for throwing exception instead of returning NULL)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r160394589 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType { val MAX_SCALE = 38 val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18) val USER_DEFAULT: DecimalType = DecimalType(10, 0) + val MINIMUM_ADJUSTED_SCALE = 6 --- End diff -- ok, I'll go with that, thanks @cloud-fan. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20174 @liancheng @liufengdb if the problem affects only the `dropDuplicates` method, can't we just check there if there are no columns and avoid doing anything at all in that case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20189 @tdas @zsxwing may I kindly ask if you might take a look at this? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20219: [SPARK-23025][SQL] Support Null type in scala ref...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20219 [SPARK-23025][SQL] Support Null type in scala reflection ## What changes were proposed in this pull request? Add support for `Null` type in the `schemaFor` method for Scala reflection. ## How was this patch tested? Added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23025 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20219.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20219 commit 98010236c013b24c0ecfa6357efaacba1a30ee84 Author: Marco Gaido Date: 2018-01-10T12:08:37Z [SPARK-23025][SQL] Support Null type in scala reflection --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20219: [SPARK-23025][SQL] Support Null type in scala ref...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20219#discussion_r160687957 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1441,6 +1441,13 @@ class DatasetSuite extends QueryTest with SharedSQLContext { assert(e.getCause.isInstanceOf[NullPointerException]) } } + + test("SPARK-23025: Add support for null type in scala reflection") { +val data = Seq(("a", null)) +checkDataset( + data.toDS(), + data: _*) --- End diff -- sure, since in this file in many places this syntax is used, should I change also other usages accordingly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20174 I see, thnaks for your answer @liancheng --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160772523 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala --- @@ -198,6 +198,20 @@ class ReplaceOperatorSuite extends PlanTest { comparePlans(optimized, correctAnswer) } + test("add one grouping key if necessary when replace Deduplicate with Aggregate") { +val input = LocalRelation() +val query = Deduplicate(Seq.empty, input) // dropDuplicates() +val optimized = Optimize.execute(query.analyze) + +val correctAnswer = --- End diff -- nit: this can be all in one line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20226 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20189 @zsxwing the test you suggested fails on Jenkins while locally it passes. I am no sure about the reason since I cannot reproduce it. I had the same test error in my previous trials. Previously the problem was that running other test cases before this one produced the issue. May you please advice what to do? Can I revert to my previous test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 any more comments @gatorsmile @cloud-fan @dongjoon-hyun @viirya @hvanhovell ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19224: [SPARK-20990][SQL] Read all JSON documents in files when...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19224 kindly ping @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org