from:"mgaido91"

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-17 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
@hhbyyh thanks for the review. I see that for some classes there are 
ongoing PRs. Thus I cannot change them now in order to have a common place and 
a common test. Should I wait for those PRs to be merged then? Or were you 
suggesting something different?
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-17 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157414340
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
* by `inputCol`. A warning will be printed if both are set.
*/
   private[feature] def isBucketizeMultipleColumns(): Boolean = {
-if (isSet(inputCols) && isSet(inputCol)) {
-  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
-"`Bucketizer` only map one column specified by `inputCol`")
-  false
+if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && 
isSet(outputCol) ||
+  isSet(inputCol) && isSet(outputCols)) {
+  throw new IllegalArgumentException("Both `inputCol` and `inputCols` 
are set, `Bucketizer` " +
+"only supports setting either `inputCol` or `inputCols`.")
--- End diff --

thanks. I will add these checks while doing the changes requested by 
@hhbyyh. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-17 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r157415970
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -112,15 +112,14 @@ case class Like(left: Expression, right: Expression) 
extends StringRegexExpressi
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
 val patternClass = classOf[Pattern].getName
 val escapeFunc = StringUtils.getClass.getName.stripSuffix("$") + 
".escapeLikeRegex"
-val pattern = ctx.freshName("pattern")
 
 if (right.foldable) {
   val rVal = right.eval()
   if (rVal != null) {
 val regexStr =
   
StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString()))
-ctx.addMutableState(patternClass, pattern,
-  s"""$pattern = ${patternClass}.compile("$regexStr");""")
+val pattern = ctx.addMutableState(patternClass, "patternLike",
+  v => s"""$v = ${patternClass}.compile("$regexStr");""", 
forceInline = true)
--- End diff --

my only concern is about point 2. I think it is a dangerous thing to do. 
What if we generate a lot of frequently used variable? I think it is safer at 
the moment to consider only 1 and 3 in the decision whether to inline or not. 
In the future, with a different codegen method, we might then define a 
threshold over which we generate an array for the given class, otherwise we use 
plain variables, which IMHO would be the best option but at the moment it is 
not feasible...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-18 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r157518488
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -112,15 +112,14 @@ case class Like(left: Expression, right: Expression) 
extends StringRegexExpressi
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
 val patternClass = classOf[Pattern].getName
 val escapeFunc = StringUtils.getClass.getName.stripSuffix("$") + 
".escapeLikeRegex"
-val pattern = ctx.freshName("pattern")
 
 if (right.foldable) {
   val rVal = right.eval()
   if (rVal != null) {
 val regexStr =
   
StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString()))
-ctx.addMutableState(patternClass, pattern,
-  s"""$pattern = ${patternClass}.compile("$regexStr");""")
+val pattern = ctx.addMutableState(patternClass, "patternLike",
+  v => s"""$v = ${patternClass}.compile("$regexStr");""", 
forceInline = true)
--- End diff --

If 3 is a precondition for 2, then it is ok. Thanks for the explanation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-18 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19811
  
LGTM too, thanks @kiszk and @bdrillard! This is a very important PR IMHO


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r157691450
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -99,6 +99,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ ArrayType(pointType1, nullable1), t2 @ 
ArrayType(pointType2, nullable2))
+if t1.sameType(t2) =>
--- End diff --

why is this `if t1.sameType(t2)` needed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r157749463
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -99,6 +99,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ ArrayType(pointType1, nullable1), t2 @ 
ArrayType(pointType2, nullable2))
+if t1.sameType(t2) =>
+  val dataType = findTightestCommonType(pointType1, pointType2).get
--- End diff --

I think that the reason is for nested structures, for instance an array of 
array or array of map


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r157749627
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -99,6 +99,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ ArrayType(pointType1, nullable1), t2 @ 
ArrayType(pointType2, nullable2))
+if t1.sameType(t2) =>
+  val dataType = findTightestCommonType(pointType1, pointType2).get
--- End diff --

then @bdrillard can we add also some test cases for nested structures?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r157787266
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -99,6 +99,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ ArrayType(pointType1, nullable1), t2 @ 
ArrayType(pointType2, nullable2))
+if t1.sameType(t2) =>
+  val dataType = findTightestCommonType(pointType1, pointType2).get
--- End diff --

@gczsjdy  I think it is needed, because IIUC in `sameType` the nullability 
is not checked to be the same. So it might vary and calling 
`findTightestCommonType` does the trick.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r157794266
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -158,11 +169,6 @@ object TypeCoercion {
 findTightestCommonType(t1, t2)
   .orElse(findWiderTypeForDecimal(t1, t2))
   .orElse(stringPromotion(t1, t2))
-  .orElse((t1, t2) match {
-case (ArrayType(et1, containsNull1), ArrayType(et2, 
containsNull2)) =>
-  findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || 
containsNull2))
-case _ => None
-  })
--- End diff --

it makes sense, but I'd love that in your implementation in 
`findTightestCommonType`, you would replicate this logic, ie. removing the 
`sameType` guard and using `findWiderTypeForTwo`, in order to allow casting an 
array of int to an array of long. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-19 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20023

[SPARK-22036][SQL] Decimal multiplication with high precision/scale often 
returns NULL

## What changes were proposed in this pull request?

When there is an operation between Decimals and the result is a number 
which is not representable exactly with the result's precision and scale, Spark 
is returning `NULL`. This was done to reflect Hive's behavior, but it is 
against SQL ANSI 2011, which states that "If the result cannot be represented 
exactly in the result type, then whether it is rounded or truncated is 
implementation-defined". Moreover, Hive now changed its behavior in order to 
respect the standard, thanks to HIVE-15331.

Therefore, the PR propose to:
 - update the rules to determine the result precision and scale according 
to the new Hive's ones introduces in HIVE-15331, which reflect SQLServer 
behavior;
 - round the result of the operations, when it is not representable exactly 
with the result's precision and scale, instead of returning `NULL`

## How was this patch tested?

modified and added UTs. Comparisons with results of Hive and SQLServer.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22036

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20023


commit 3037d4aa6afc4d7630d86d29b8dd7d7d724cc990
Author: Marco Gaido 
Date:   2017-12-17T21:45:06Z

[SPARK-22036][SQL] Decimal multiplication with high precision/scale often 
returns NULL




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157809630
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala ---
@@ -49,12 +49,10 @@ object DFSReadWriteTest {
   }
 
   private def printUsage(): Unit = {
-val usage: String = "DFS Read-Write Test\n" +
-"\n" +
-"Usage: localFile dfsDir\n" +
-"\n" +
-"localFile - (string) local file to use in test\n" +
-"dfsDir - (string) DFS directory for read/write tests\n"
+val usage = s"""DFS Read-Write Test 
--- End diff --

here you should use
```
"""
 |...
 """.stripMargin
```
otherwise you introduce a lot of spaces


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157810171
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala 
---
@@ -95,7 +95,7 @@ object LocalALS {
 
   def showWarning() {
 System.err.println(
-  """WARN: This is a naive implementation of ALS and is given as an 
example!
+  s"""WARN: This is a naive implementation of ALS and is given as an 
example!
--- End diff --

please here revert the change and do the same in all similar places, since 
there is no variable to interpolate


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20021#discussion_r157812984
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -930,6 +930,16 @@ class CodegenContext {
   // inline execution if only one block
   blocks.head
 } else {
+  if (Utils.isTesting) {
+// Passing global variables to the split method is dangerous, as 
any mutating to it is
+// ignored and may lead to unexpected behavior.
+val mutableStateNames = mutableStates.map(_._2).toSet
--- End diff --

after finally merging SPARK-18016, this should be the union of 
`arrayCompactedMutableStates.flatMap(_.arrayNames)` and 
`inlinedMutableStates.map(_._2)`, I think


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans

2017-12-19 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19340
  
hi @Kevin-Ferret , thanks for looking at this. Yes, you are right, I have 
not changed the method for updating the centroids. The current methods seems to 
me the most widely adopted for cosine similarity too. Indeed, the same approach 
is used in RapidMiner 
(https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/k_means.html)
 and also in this paper 
(https://s3.amazonaws.com/academia.edu.documents/32952068/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1513706450&Signature=MFPcahadw35IpP2o0v%2F51xW7KOM%3D&response-content-disposition=inline%3B%20filename%3DSimilarity_Measures_for_Text_Document_Cl.pdf).
I think this is the right place where to discuss it, since it is related to 
the implementation I am proposing in the PR.
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20021#discussion_r157820471
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -930,6 +930,18 @@ class CodegenContext {
   // inline execution if only one block
   blocks.head
 } else {
+  if (Utils.isTesting) {
+// Passing global variables to the split method is dangerous, as 
any mutating to it is
+// ignored and may lead to unexpected behavior.
+// We don't need to check `arrayCompactedMutableStates` here, as 
it results to array access
--- End diff --

what if we declare a variable with the same name of an 
`arrayCompactedMutableStates `? Let's say that we have:
```
public class Foo {
 private Object[] ourArray;
 // 
 private void ourMethod() {
   Object[] ourArray = new Object[1];
   ourSplitFunction(ourArray);
 }
 private void ourSplitFunction(Object[] ourArray) {
  ourArray[0] = null;
 }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20021#discussion_r157822361
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -930,6 +930,18 @@ class CodegenContext {
   // inline execution if only one block
   blocks.head
 } else {
+  if (Utils.isTesting) {
+// Passing global variables to the split method is dangerous, as 
any mutating to it is
+// ignored and may lead to unexpected behavior.
+// We don't need to check `arrayCompactedMutableStates` here, as 
it results to array access
--- End diff --

I don't think so, but currently there is also no place which creates the 
problem for which this assertion is being introduces. Of course this case is 
very very unlikely, but since we are introducing the check, I think that the 
effort to ensure also this very remote corner case is very low...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
@hhbyyh I created a common infrastructure. Please let me know if I have to 
modify it. Meanwhile, I'd like to discuss where to put the common UTs: do you 
have any specific idea about the right place? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-19 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
@hhbyyh thanks for the comments. I already fixed all your comments, but I 
am waiting to push for the UT. Honestly I think that `checkParam` is not the 
best place. Checking the exception requires setting the parameters and invoking 
`transform`, thus I am not sure it is the best place, since it is a very 
generic one and at the moment we are setting no parameter there. What do you 
think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20021: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20021#discussion_r157962547
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -930,6 +930,18 @@ class CodegenContext {
   // inline execution if only one block
   blocks.head
 } else {
+  if (Utils.isTesting) {
--- End diff --

as you said, it _may_ lead, but likely it doesn't. Then I do think that the 
best option is to assert it only in testing, where this might help finding 
_potential_ bugs. In production it is an overkill to throw an exception for a 
situation which most likely is not a problem IMHO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r157967667
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -158,11 +169,6 @@ object TypeCoercion {
 findTightestCommonType(t1, t2)
   .orElse(findWiderTypeForDecimal(t1, t2))
   .orElse(stringPromotion(t1, t2))
-  .orElse((t1, t2) match {
-case (ArrayType(et1, containsNull1), ArrayType(et2, 
containsNull2)) =>
-  findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || 
containsNull2))
-case _ => None
-  })
--- End diff --

I like your idea @gczsjdy!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@cloud-fan @dongjoon-hyun @gatorsmile @rxin @viirya I saw you worked on 
this files. Maybe you can help reviewing the PR. For further details about the 
reasons of this PR, please refer to the e-mail I sent on the dev mail list. 
Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@cloud-fan yes, Hive changed and most important at the moment we are not 
compliant with SQL standard. So currently Spark is returning results which are 
different from Hive and not compliant with SQL standard. This is why I proposed 
this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20021
  
Honestly, I liked very much doing the test only for testing and not 
throwing an exception in production. IMHO it is an overkill to throw an 
exception in production and in the remote case that we happen to forget one 
place where this check can throw the exception, but it is not an issue, as it 
is perfectly possible, this would also cause a regression.

Thus, honestly I am strongly against this solution.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
kindly ping @cloud-fan @gatorsmile @HyukjinKwon @zero323 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
thanks @hhbyyh, I updated the PR according to your suggestion and previous 
comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
thanks for looking at this @hvanhovell. The reasons why I didn't introduce 
a configuration variable for this behavior are:

 1. As far as I know, currently there is no way to read reliably a 
configuration in catalyst;
 2. Also in Hive, the behavior was changed without introducing any 
configuration to switch back to the previous behavior;
 3. Many people are complaining about the current Spark behavior in the 
JIRA and therefore it seems that the previous behavior is neither desired nor 
useful to users.

Let me know if you don't agree with these arguments. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@hvanhovell, as far as 1 is regarded, I was referring to [this 
comment](https://github.com/apache/spark/pull/19449#pullrequestreview-67789784) 
and [this PR](https://github.com/apache/spark/pull/18568) where it is 
explicitly stated that using `SQLConf` here is not safe and it shouldn't be 
done. Let me know if I am missing something.
I am sorry, but I think I haven't fully understood what you meant by 

> Or you can wire up the rules using the SessionStateBuilder

may I kindly ask you if you could elaborate this sentence a bit more? Thank 
you very much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158157137
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -1526,15 +1526,15 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(sql("select 10.30 * 3.00"),
   Row(BigDecimal("30.9000", new 
MathContext(38
 checkAnswer(sql("select 10.30 * 
3.000"),
-  Row(null))
--- End diff --

The third case is never checked in the current codebase, ie. when we go out 
of the representable range of values. I haven't added a test for it, because I 
was waiting for feedbacks by the community about how to handle the 3rd case and 
I focused this PR only on points 1 and 2. But I can add a test case for it and 
eventually change it in a future PR to address the 3rd point in the e-mail. 
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r158158375
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
---
@@ -430,4 +433,49 @@ object ParamsSuite extends SparkFunSuite {
 require(copyReturnType === obj.getClass,
   s"${clazz.getName}.copy should return ${clazz.getName} instead of 
${copyReturnType.getName}.")
   }
+
+  /**
+   * Checks that the class throws an exception in case both `inputCols` 
and `inputCol` are set and
+   * in case both `outputCols` and `outputCol` are set.
+   * These checks are performed only whether the class extends 
respectively both `HasInputCols` and
+   * `HasInputCol` and both `HasOutputCols` and `HasOutputCol`.
+   *
+   * @param paramsClass The Class to be checked
+   * @param spark A `SparkSession` instance to use
+   */
+  def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: 
SparkSession): Unit = {
+import spark.implicits._
+// create fake input Dataset
+val feature1 = Array(-1.0, 0.0, 1.0)
+val feature2 = Array(1.0, 0.0, -1.0)
+val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2")
--- End diff --

The reason why I created the dataframe inside the method was to control the 
names of the columns it has. Otherwise we can't ensure that those columns 
exist. I think that the type check is performed later, thus it is  not a 
problem here. What do you think?

I preferred to use `paramsClass: Class[_ <: Params]` because I need a clean 
instance for each of the two checks: if an instance is passed I cannot enforce 
that it is clean, ie. some parameters weren't already set and I would need to 
copy it to create new instances as well, since otherwise the second check would 
be influenced by the first one. What do you think?

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-20 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
Thank you for your review @dongjoon-hyun. I think what we can do is add 
more test to the whitelist in `HiveCompatibilitySuite`, updating them according 
to HIVE-15331. Were you thinking to this or something different? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158224551
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
+case v: Int => fromBigDecimal(BigDecimal(v))
+case v: Long => fromBigDecimal(BigDecimal(v))
+case _ => forType(literal.dataType)
+  }
+
+  private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = {
+DecimalType(Math.max(d.precision, d.scale), d.scale)
+  }
+
   private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
 DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  // scalastyle:off line.size.limit
+  /**
+   * Decimal implementation is based on Hive's one, which is itself 
inspired to SQLServer's one.
+   * In particular, when a result precision is greater than {@link 
#MAX_PRECISION}, the
+   * corresponding scale is reduced to prevent the integral part of a 
result from being truncated.
+   *
+   * For further reference, please see
+   * 
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/.
+   *
+   * @param precision
+   * @param scale
+   * @return
+   */
+  // scalastyle:on line.size.limit
+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
+// Assumptions:
+// precision >= scale
+// scale >= 0
--- End diff --

I can add it even though it is not needed... there is no way we can violate 
those constraints. If you believe it is better to use assert, I will do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158225279
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
 ---
@@ -243,17 +248,43 @@ object DecimalPrecision extends TypeCoercionRule {
 // Promote integers inside a binary expression with fixed-precision 
decimals to decimals,
 // and fixed-precision decimals in an expression with floats / doubles 
to doubles
 case b @ BinaryOperator(left, right) if left.dataType != 
right.dataType =>
-  (left.dataType, right.dataType) match {
-case (t: IntegralType, DecimalType.Fixed(p, s)) =>
-  b.makeCopy(Array(Cast(left, DecimalType.forType(t)), right))
-case (DecimalType.Fixed(p, s), t: IntegralType) =>
-  b.makeCopy(Array(left, Cast(right, DecimalType.forType(t
-case (t, DecimalType.Fixed(p, s)) if isFloat(t) =>
-  b.makeCopy(Array(left, Cast(right, DoubleType)))
-case (DecimalType.Fixed(p, s), t) if isFloat(t) =>
-  b.makeCopy(Array(Cast(left, DoubleType), right))
-case _ =>
-  b
-  }
+  nondecimalLiteralAndDecimal(b).lift((left, right)).getOrElse(
+nondecimalNonliteralAndDecimal(b).applyOrElse((left.dataType, 
right.dataType),
+  (_: (DataType, DataType)) => b))
   }
+
+  /**
+   * Type coercion for BinaryOperator in which one side is a non-decimal 
literal numeric, and the
+   * other side is a decimal.
+   */
+  private def nondecimalLiteralAndDecimal(
--- End diff --

Yes, it is. If we don't introduce this, we have a failure in Hive 
compatibility tests, because Hive use the exact precision and scale needed by 
the literals, while we, before this change, were using conservative values for 
each type. For instance, if we have a `select 123.12345*3`, before this change 
`3` would have been interpreted as `Decimal(10, 0)`, which is the type for 
integers. After the change, `3` would become `Decimal(1, 0)`, as Hive does. 
This prevents from needing more precision that what is actually needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158225412
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
--- End diff --

yes, please see my comment above for an example. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158225505
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
--- End diff --

No, please see my comments above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158225632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
+case v: Int => fromBigDecimal(BigDecimal(v))
+case v: Long => fromBigDecimal(BigDecimal(v))
+case _ => forType(literal.dataType)
+  }
+
+  private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = {
+DecimalType(Math.max(d.precision, d.scale), d.scale)
+  }
+
   private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
 DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  // scalastyle:off line.size.limit
+  /**
+   * Decimal implementation is based on Hive's one, which is itself 
inspired to SQLServer's one.
+   * In particular, when a result precision is greater than {@link 
#MAX_PRECISION}, the
+   * corresponding scale is reduced to prevent the integral part of a 
result from being truncated.
+   *
+   * For further reference, please see
+   * 
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/.
+   *
+   * @param precision
+   * @param scale
+   * @return
+   */
+  // scalastyle:on line.size.limit
+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
+// Assumptions:
+// precision >= scale
+// scale >= 0
+if (precision <= MAX_PRECISION) {
+  // Adjustment only needed when we exceed max precision
+  DecimalType(precision, scale)
--- End diff --

this is prevented outside this function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158225832
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
+case v: Int => fromBigDecimal(BigDecimal(v))
+case v: Long => fromBigDecimal(BigDecimal(v))
+case _ => forType(literal.dataType)
+  }
+
+  private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = {
+DecimalType(Math.max(d.precision, d.scale), d.scale)
+  }
+
   private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
 DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  // scalastyle:off line.size.limit
+  /**
+   * Decimal implementation is based on Hive's one, which is itself 
inspired to SQLServer's one.
+   * In particular, when a result precision is greater than {@link 
#MAX_PRECISION}, the
+   * corresponding scale is reduced to prevent the integral part of a 
result from being truncated.
+   *
+   * For further reference, please see
+   * 
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/.
+   *
+   * @param precision
+   * @param scale
+   * @return
+   */
+  // scalastyle:on line.size.limit
+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
+// Assumptions:
+// precision >= scale
+// scale >= 0
+if (precision <= MAX_PRECISION) {
+  // Adjustment only needed when we exceed max precision
+  DecimalType(precision, scale)
+} else {
+  // Precision/scale exceed maximum precision. Result must be adjusted 
to MAX_PRECISION.
+  val intDigits = precision - scale
+  // If original scale less than MINIMUM_ADJUSTED_SCALE, use original 
scale value; otherwise
+  // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits
+  val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)
--- End diff --

It is the `MINIMUM_ADJUSTED_SCALE`. We can't have a scale lower that that, 
even though we would need not to loose precision.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158226546
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
+case v: Int => fromBigDecimal(BigDecimal(v))
+case v: Long => fromBigDecimal(BigDecimal(v))
+case _ => forType(literal.dataType)
+  }
+
+  private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = {
+DecimalType(Math.max(d.precision, d.scale), d.scale)
+  }
+
   private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
 DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  // scalastyle:off line.size.limit
+  /**
+   * Decimal implementation is based on Hive's one, which is itself 
inspired to SQLServer's one.
+   * In particular, when a result precision is greater than {@link 
#MAX_PRECISION}, the
+   * corresponding scale is reduced to prevent the integral part of a 
result from being truncated.
+   *
+   * For further reference, please see
+   * 
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/.
+   *
+   * @param precision
+   * @param scale
+   * @return
+   */
+  // scalastyle:on line.size.limit
+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
+// Assumptions:
+// precision >= scale
+// scale >= 0
+if (precision <= MAX_PRECISION) {
+  // Adjustment only needed when we exceed max precision
+  DecimalType(precision, scale)
+} else {
+  // Precision/scale exceed maximum precision. Result must be adjusted 
to MAX_PRECISION.
+  val intDigits = precision - scale
+  // If original scale less than MINIMUM_ADJUSTED_SCALE, use original 
scale value; otherwise
+  // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits
+  val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)
+  val adjustedScale = Math.max(MAX_PRECISION - intDigits, 
minScaleValue)
--- End diff --

It is `max` because we take either the scale which would prevent a loss of 
"space" for `intDigits`, ie. the part on the left of the dot, or the 
`minScaleValue`, which is the scale we are ensuring to provide at least.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@gatorsmile, please refer to the [e-mail to the dev mail 
list](https://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCAEorWNAJ4TxJR9NBcgSFMD_VxTg8qVxusjP%2BAJP-x%2BJV9zH-yA%40mail.gmail.com%3E)
 for further details. I run the script I added to the tests in this PR, the 
results are:

 - Hive behaves exactly as Spark after this PR;
 - SQLServer the same, even though on additions and subtractions it seems 
to maintain one more precision digit in some cases (I am running SQLServer 
2017, since Hive implementation, and therefore this too, are inspired to 
SQLServer2005, there might have been a small behavior change in this case). 
Anyway, differently from Hive and Spark it throws an exception in case 3 
described in the email (it is compliant to SQL standard, point 3 of the email 
is out of scope of this PR, I will create another PR for it once we agree on 
how to handle that case);
 - Oracle and Postgres have nearly infinite precision. Thus it is nearly 
impossible to provoke a rounding on them. If we force a precision loss on them 
(point 3 of the email, out of scope of this PR) they throw an exception 
(compliant to SQL standard and SQLServer);

Here you are the outputs of the queries.

**Hive 2.3.0 (same as Spark after PR)**

```
0: jdbc:hive2://localhost:1> create table decimals_test(id int, a 
decimal(38,18), b decimal(38,18));
No rows affected (2.085 seconds)
0: jdbc:hive2://localhost:1> insert into decimals_test values (1, 
100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 
123456789123456789.0, 1.123456789123456789);
No rows affected (14.054 seconds)
0: jdbc:hive2://localhost:1> select id, a+b, a-b, a*b, a/b from 
decimals_test order by id;

+-+---+---+++
| id  |  _c1  |  _c2
  |_c3 |_c4 |

+-+---+---+++
| 1   | 1099.0| -899.0  
  | 99900.00   | 0.100100   |
| 2   | 24690.24600   | 0E-17   
  | 152402061.885129   | 1.00   |
| 3   | 1234.2234567891011| -1233.9765432108989 
  | 152.358023 | 0.000100   |
| 4   | 123456789123456790.12345678912345679  | 
123456789123456787.87654321087654321  | 138698367904130467.515623  | 
109890109097814272.043109  |

+-+---+---+++
```

**SQLServer 2017**

```
1> create table decimals_test(id int, a decimal(38,18), b decimal(38,18));
2> insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 
12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 
1.123456789123456789);
3> select id, a+b, a-b, a*b, a/b from decimals_test order by id;
4> GO

(4 rows affected)
id  

   
---  
 
 

  1  1099.00  
-899.00 99900.00
  .100100
  2 24690.246000  
.00 152402061.885129
 1.00
  3  1234.22345678910110 
-1233.97654321089890   152.358023   
   .000100
  4123456789123456790.123456789123456789
123456789123456787.876543210876543211138698367904130467.515623  
  109890109097814272.043109
```

**Postgres and Oracle**

```
postgres=# create table decimals_test(id int, a decimal(38,18), b 
decimal(38,18));
CREATE TABLE
postgres=# insert into decimals_test values (1, 100.0, 999.0), (2, 
12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 
1.123456789123456789);
INSERT 0 4
postgres=# select id, a+

[GitHub] spark issue #20021: [SPARK-22668][SQL] Ensure no global variables in argumen...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20021
  
LGTM too, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-21 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19940
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r158242030
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
+case v: Int => fromBigDecimal(BigDecimal(v))
+case v: Long => fromBigDecimal(BigDecimal(v))
+case _ => forType(literal.dataType)
+  }
+
+  private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = {
+DecimalType(Math.max(d.precision, d.scale), d.scale)
+  }
+
   private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
 DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  // scalastyle:off line.size.limit
+  /**
+   * Decimal implementation is based on Hive's one, which is itself 
inspired to SQLServer's one.
+   * In particular, when a result precision is greater than {@link 
#MAX_PRECISION}, the
+   * corresponding scale is reduced to prevent the integral part of a 
result from being truncated.
+   *
+   * For further reference, please see
+   * 
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/.
+   *
+   * @param precision
+   * @param scale
+   * @return
+   */
+  // scalastyle:on line.size.limit
+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
+// Assumptions:
+// precision >= scale
+// scale >= 0
+if (precision <= MAX_PRECISION) {
+  // Adjustment only needed when we exceed max precision
+  DecimalType(precision, scale)
+} else {
+  // Precision/scale exceed maximum precision. Result must be adjusted 
to MAX_PRECISION.
+  val intDigits = precision - scale
+  // If original scale less than MINIMUM_ADJUSTED_SCALE, use original 
scale value; otherwise
+  // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits
+  val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)
--- End diff --

Yes, sorry, my answer was very poor, I will rephrase. `scale` contains the 
scale which we need to represent the values without any precision loss. What we 
are doing here is saying that the lower bound for the scale is either the scale 
that we need to correctly represent the value or the `MINIMUM_ADJUSTED_SCALE`. 
After this, in the line below we state that the scale we will use is the max 
between the number of digits of the precision we don't need on the left of the 
dot and this `minScaleValue`: ie. even though in some cases we might need a 
scale higher than `MINIMUM_ADJUSTED_SCALE`, but the number of digits needed on 
the left on the dot would force us to have a scale lower than 
`MINIMUM_ADJUSTED_SCALE`, we enforce that we will maintain at least 
`MINIMUM_ADJUSTED_SCALE`. We can't let the scale be lower that this threshold, 
even though it would be needed to enforce that we don't loose digits on the 
left of the dot. Please refer also to the blog post I linked in the comment 
above fo
 r further (hopefully better) explanation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-21 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19940
  
the test errors are unrelated to this change. Any other comments @cloud-fan 
@kiszk @viirya ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19940: [SPARK-22750][SQL] Reuse mutable states when possible

2017-12-21 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19940
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19940: [SPARK-22750][SQL] Reuse mutable states when poss...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19940#discussion_r158336014
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1193,7 +1196,8 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
 val tzTerm = ctx.addMutableState(tzClass, "tz",
   v => s"""$v = $dtu.getTimeZone("$tz");""")
-val utcTerm = ctx.addMutableState(tzClass, "utc",
+val utcTerm = "tzUTC"
+ctx.addImmutableStateIfNotExists(tzClass, utcTerm,
   v => s"""$v = $dtu.getTimeZone("UTC");""")
--- End diff --

yes, there is no difference between them in practice. But I think that 
being consistent would be better for readability


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20043: [SPARK-22856][SQL] Add wrappers for codegen outpu...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20043#discussion_r158361160
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala
 ---
@@ -321,7 +322,12 @@ case class IsNull(child: Expression) extends 
UnaryExpression with Predicate {
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val eval = child.genCode(ctx)
-ExprCode(code = eval.code, isNull = "false", value = eval.isNull)
+val value = if ("true" == s"${eval.isNull}" || "false" == 
s"${eval.isNull}") {
--- End diff --

or `eval.isNull == Literal("true")`? Or even better we can create a 
`LiteralTrue = Literal("true")` and equivalent for false and use them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20043: [SPARK-22856][SQL] Add wrappers for codegen outpu...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20043#discussion_r158361417
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -56,7 +56,36 @@ import org.apache.spark.util.{ParentClassLoader, Utils}
  * @param value A term for a (possibly primitive) value of the result of 
the evaluation. Not
  *  valid if `isNull` is set to `true`.
  */
-case class ExprCode(var code: String, var isNull: String, var value: 
String)
+case class ExprCode(var code: String, var isNull: ExprValue, var value: 
ExprValue)
+
+
+// An abstraction that represents the evaluation result of [[ExprCode]].
+abstract class ExprValue
--- End diff --

IMHO I prefer this approach because in the future we might need to 
distinguish these two cases, thus I think is a good thing to let them be 
distinct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-21 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
thanks @hhbyyh, I removed the 2 too strict checks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-22 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@gatorsmile I answered to your comments about DB2 in the e-mail.

@cloud-fan that would help, but not solve the problem. It would just make 
the problem being generated by bigger numbers.

As you can see from the e-mail, DB2 behavior is actually in accordance to 
SQL standards and the other DBs, it just have a smaller maximum precision. And 
the case of throwing an exception is point 3 of my e-mails and it is out of 
scope of this PR, because I think we best discuss before which is the right 
approach in that case and then I can eventually create a PR.
Therefore, also DB2 behavior is aligned to the other SQL engines, the SQL 
standard and Spark is the only one which is currently behaving differently.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-22 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
@gatorsmile I added the test, but I didn't get what needs to be updated in 
`registerPython`. May you explain me please? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-22 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158539355
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala 
---
@@ -29,9 +29,12 @@ case class PythonUDF(
 func: PythonFunction,
 dataType: DataType,
 children: Seq[Expression],
-evalType: Int)
+evalType: Int,
+udfDeterministic: Boolean = true)
--- End diff --

no, it is not needed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-22 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
thank you @cloud-fan, changed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-22 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-23 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158579661
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -434,6 +434,16 @@ def test_udf_with_array_type(self):
 self.assertEqual(list(range(3)), l1)
 self.assertEqual(1, l2)
 
+def test_nondeterministic_udf(self):
+from pyspark.sql.functions import udf
+import random
+udf_random_col = udf(lambda: int(100 * random.random()), 
IntegerType()).asNondeterministic()
+df = 
self.spark.createDataFrame([Row(1)]).select(udf_random_col().alias('RAND'))
+random.seed(1234)
+udf_add_ten = udf(lambda rand: rand + 10, IntegerType())
+[row] = df.withColumn('RAND_PLUS_TEN', 
udf_add_ten('RAND')).collect()
+self.assertEqual(row[0] + 10, row[1])
--- End diff --

I am not sure why, but setting the seed doesn't seem to take effect. I will 
remove setting the seed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-23 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158579754
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2075,9 +2075,10 @@ class PandasUDFType(object):
 def udf(f=None, returnType=StringType()):
--- End diff --

I followed what was done for scala UDF, where this parameter is not added, 
but there is a method to add it. If we add a parameter here, I'd then suggest 
to add it also to the scala API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-23 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@gatorsmile that is scenario 3. I will explain you why and after I will do 
and errata corrige of the summary I did in my last e-mail, because I made a 
mistake about how DB2 computes the result precision and scale, sorry for that.

Anyway, what you showed is an example of point 3 because DB2 computes the 
result type as `DECIMAL( MIN(31, p1 + p2), MIN(31, s1 + s2) )`. Therefore, in 
your case the result type was `DECIMAL(31, 31)`. Since your result had more 
than 0 significant digits, it was out out of the range of the representable 
values and an overflow exception was thrown.
You can reproduce case 2 as follows:
```
db2 => create table decimals_test (id int, a decimal(31,31), b 
decimal(31,31))
DB2I  The SQL command completed successfully.
db2 => insert into decimals_test values(1, 0.12345678912345678912345689, 
0.12345678912345678912345) 
DB2I  The SQL command completed successfully.
db2 => select a*b from dd

1
-
 .0152415787806736785461049526020
```
As you can see a truncation occurred.

Now, let me amend my table to summarize the behavior of the many DBs:
1. **Rules to determine precision and scale**
 - *Hive, SQLServer (and Spark after the PR)*: I won't include the
exact formulas, anyway the relevant part is that in case of precision
higher that the maximum value, we use the maximum available value (38) as
precision and the maximum between the needed scale (computing according the
relevant formula) and a minimum value guaranteed for the scale which is 6.
 - *DB2*: computes the result type as `DECIMAL( MIN(31, p1 + p2), 
MIN(31, s1 + s2) )`.
 - *Postgres and Oracle*: NA
 - *SQL ANSI 2011*: no indication
 - *Spark now*: if the precision needed is more than 38, use 38 as
precision; use the needed scale without any adjustment.

  2. **Behavior in case of precision loss but result in the range of the
representable values**
 - *Oracle, Hive, SQLServer (and Spark after the PR)*: round the result.
 - *DB2*: truncates the result (and sets a warning flag).
 - *Postgres*: NA, it has infinite precision...
 - *SQL ANSI 2011*: either truncate or round the value.
 - *Spark now*: returns NULL.

  3. **Behavior in case of result out of the range of the representable
values (i.e overflow)**
 - *Oracle, DB2, SQLServer*: throw an exception.
 - *Postgres*: NA, they have nearly infinite precision...
 - *SQL ANSI 2011*: an exception should be raised
 - *Spark now, Hive*: return NULL (for Hive, there is a open ticket to
make it compliant to the SQL standard).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-24 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
Thanks for your analysis @gatorsmile. Actually the rule you specified for 
Oracle is what it uses when casting, rather then when doing arithmetic 
operations. Yes DB2 has rather different rules to define the output type of 
operations. Anyway, we can have a behavior practically identical to DB2 by 
changing the value of `MINIMUM_ADJUSTED_SCALE` to 31. Therefore, I'd propose, 
instead of using the configuration you pointed out, to use a configuration for 
the `MINIMUM_ADJUSTED_SCALE`, changing which we can have both the behavior of 
Hive and SQLServer and the one of DB2. What do you think?

The reason why I am suggesting this is that my first concern is not Hive 
compliance, but SQL standard compliance. Indeed, as you con see from the 
summary, on point 1 there is not a uniform behavior (but this is OK to SQL 
standard since it gives freedom). But on point 2 we are the only ones who are 
not compliant to SQL standard. And having this behavior by default doesn't look 
the right thing to do IMHO. On point 3, only we and Hive are not compliant. 
Thus I think also that should be changed. But in that case, we can't use the 
same flag, because it would be inconsistent. What do you think?

I can understand and agree that loosing precision looks scary. But to me 
returning `NULL` is even more scary if possible: indeed, `NULL` is what should 
be returned if either if the two operands are `NULL`. Thus queries running on 
other DBs which relies on this might return very bad result. For instance, 
let's think to a report where we join a prices table and a sold_product table 
per country. In this use case, we can assume that if the result is `NULL`, it 
means that there was no sold product in that country and then coalesce the 
output of the multiplication to 0. This would work well on any DB but Spark. 
With my proposal of tuning the `MINIMUM_ADJUSTED_SCALE`, each customer can 
decide (query by query) how much precision loss they can tolerate. And if we 
agree to change point 3 behavior to the SQL standard, in case of it is not 
possible to meet their desires we throw an exception, giving them the choice 
about what to do: allow more precision loss, change their input data type, et
 c. etc. This is the safer way IMHO.

I would ne happy to help improving test cases. May I just kindly ask you 
how you meant to do that? What would you like to be tested more? Would you like 
me to add more test cases in scope of this PR or to open a new one for that?

Thank you for your time reading my long messages. I just want to take the 
best choice and give you all the elements I have to decide for the best all 
together.
Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-25 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
Thanks @gatorsmile. Then should I create a follow up PR for #20008 in order 
to cover the cases 2 and 3 before going on with this PR or can we go on with 
this PR and the test cases added in this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-25 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
@gatorsmile, yes, the reason why seed doesn't work is in the way Python 
UDFs are executed, i.e. a new python process is created for each partition to 
evaluate the Python UDF. Thus the seed is set only on the driver, but not in 
the process where the UDF is executed. What I am saying can be easily confirmed 
by this:
```
>>> from pyspark.sql.functions import udf
>>> import os
>>> pid_udf = udf(lambda: str(os.getpid()))
>>> spark.range(2).select(pid_udf()).show()
+--+

|()|
+--+
|  4132|
|  4130|
+--+
>>> os.getpid()
4070
```
Therefore there is no easy way to set the seed. If I set it inside the UDF, 
the UDF would become deterministic. Therefore I think that the best option is 
the current test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22901][PYTHON] Add deterministic flag to pyspark ...

2017-12-26 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19929
  
@gatorsmile done, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20084: [SPARK-22904][SQL] Add tests for decimal operatio...

2017-12-26 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20084

[SPARK-22904][SQL] Add tests for decimal operations and string casts

## What changes were proposed in this pull request?

Test coverage for arithmetic operations leading to:

 1. Precision loss
 2. Overflow

Moreover, tests for casting bad string to other input types and for using 
bad string as operators of some functions.

## How was this patch tested?

added tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22904

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20084.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20084


commit 7a9650aa4737768a4428de65841050ca8b157528
Author: Marco Gaido 
Date:   2017-12-26T17:44:01Z

[SPARK-22904][SQL] Add tests for decimal operations and string casts




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-26 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@gatorsmile created #20084 for adding tests. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
@MLnick @viirya  @WeichenXu123 any further comments on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20084: [SPARK-22904][SQL] Add tests for decimal operations and ...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20084
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158843869
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -158,11 +213,8 @@ object TypeCoercion {
 findTightestCommonType(t1, t2)
   .orElse(findWiderTypeForDecimal(t1, t2))
   .orElse(stringPromotion(t1, t2))
-  .orElse((t1, t2) match {
-case (ArrayType(et1, containsNull1), ArrayType(et2, 
containsNull2)) =>
-  findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || 
containsNull2))
-case _ => None
-  })
+  .orElse(findWiderTypeForTwoComplex(t1, t2, findWiderTypeForTwo))
+
--- End diff --

this line should be removed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158844162
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -389,6 +389,25 @@ class TypeCoercionSuite extends AnalysisTest {
 widenTest(StringType, MapType(IntegerType, StringType, true), None)
 widenTest(ArrayType(IntegerType), StructType(Seq()), None)
 
+widenTest(
+  ArrayType(StringType, containsNull=true),
--- End diff --

please add a whitespace before and after `=` and the same in the following 
lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158844322
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -389,6 +389,25 @@ class TypeCoercionSuite extends AnalysisTest {
 widenTest(StringType, MapType(IntegerType, StringType, true), None)
 widenTest(ArrayType(IntegerType), StructType(Seq()), None)
 
+widenTest(
+  ArrayType(StringType, containsNull=true),
+  ArrayType(StringType, containsNull=false),
+  Some(ArrayType(StringType, containsNull=true)))
+widenTest(
+  MapType(StringType, StringType, valueContainsNull=true),
+  MapType(StringType, StringType, valueContainsNull=false),
+  Some(MapType(StringType, StringType, valueContainsNull=true)))
--- End diff --

please add a test also for struct


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158844437
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -389,6 +389,25 @@ class TypeCoercionSuite extends AnalysisTest {
 widenTest(StringType, MapType(IntegerType, StringType, true), None)
 widenTest(ArrayType(IntegerType), StructType(Seq()), None)
 
+widenTest(
+  ArrayType(StringType, containsNull=true),
+  ArrayType(StringType, containsNull=false),
+  Some(ArrayType(StringType, containsNull=true)))
+widenTest(
+  MapType(StringType, StringType, valueContainsNull=true),
+  MapType(StringType, StringType, valueContainsNull=false),
+  Some(MapType(StringType, StringType, valueContainsNull=true)))
+
+widenTest(
+  StructType(StructField("a", ArrayType(StringType, 
containsNull=true)) :: Nil),
+  StructType(StructField("a", ArrayType(StringType, 
containsNull=false)) :: Nil),
+  Some(StructType(StructField("a", ArrayType(StringType, 
containsNull=true)) :: Nil)))
+widenTest(
+  StructType(StructField("a", MapType(StringType, StringType, 
valueContainsNull=true)) :: Nil),
+  StructType(StructField("a", MapType(StringType, StringType, 
valueContainsNull=false)) :: Nil),
+  Some(StructType(
+StructField("a", MapType(StringType, StringType, 
valueContainsNull=true)) :: Nil)))
--- End diff --

thanks, may you please add also two tests having Map and Array as outer 
types?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158844477
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -469,12 +488,21 @@ class TypeCoercionSuite extends AnalysisTest {
   ArrayType(ArrayType(IntegerType), containsNull = false),
   ArrayType(ArrayType(LongType), containsNull = false),
   Some(ArrayType(ArrayType(LongType), containsNull = false)))
+widenTestWithStringPromotion(
+  StructType(StructField("a", ArrayType(LongType)) :: Nil),
+  StructType(StructField("a", ArrayType(StringType)) :: Nil),
+  Some(StructType(StructField("a", ArrayType(StringType)) :: Nil)))
--- End diff --

please add an analogous test also for map and array


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158844501
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -469,12 +488,21 @@ class TypeCoercionSuite extends AnalysisTest {
   ArrayType(ArrayType(IntegerType), containsNull = false),
   ArrayType(ArrayType(LongType), containsNull = false),
   Some(ArrayType(ArrayType(LongType), containsNull = false)))
+widenTestWithStringPromotion(
+  StructType(StructField("a", ArrayType(LongType)) :: Nil),
+  StructType(StructField("a", ArrayType(StringType)) :: Nil),
+  Some(StructType(StructField("a", ArrayType(StringType)) :: Nil)))
+
 
 // Without string promotion
 widenTestWithoutStringPromotion(IntegerType, StringType, None)
 widenTestWithoutStringPromotion(StringType, TimestampType, None)
 widenTestWithoutStringPromotion(ArrayType(LongType), 
ArrayType(StringType), None)
 widenTestWithoutStringPromotion(ArrayType(StringType), 
ArrayType(TimestampType), None)
+widenTestWithoutStringPromotion(
+  StructType(StructField("a", ArrayType(LongType)) :: Nil),
+  StructType(StructField("a", ArrayType(StringType)) :: Nil),
+  None)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r158844539
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -469,12 +488,21 @@ class TypeCoercionSuite extends AnalysisTest {
   ArrayType(ArrayType(IntegerType), containsNull = false),
   ArrayType(ArrayType(LongType), containsNull = false),
   Some(ArrayType(ArrayType(LongType), containsNull = false)))
+widenTestWithStringPromotion(
+  StructType(StructField("a", ArrayType(LongType)) :: Nil),
+  StructType(StructField("a", ArrayType(StringType)) :: Nil),
+  Some(StructType(StructField("a", ArrayType(StringType)) :: Nil)))
+
--- End diff --

this empty line is not needed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-27 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20010
  
@bdrillard yes, after addressing my last comments I think we can build 
this. My only concern is that we are not covering the case in which there are 
nested complex structures in the wider case. But maybe we can address this 
later.

I don't have permissions to trigger a build. You might check who are the 
last committers working on this and maybe ask them.
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-28 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20029
  
What it seems is never closed by your analysis is the client used to 
interact with the metastore. This might be a problem which we are not aware of 
in normal SQL applications, since we have only one client in those cases.

What you are doing in your fix is avoiding creating a client for each 
`HiveSessionBuilder`, thus:

 1. this would mean that we are creating more than one `SessionBuilder`, 
ie. more than one `SparkSession`, which is not true as far as I know.
 2. any session would share the same client to connect to the metastore, 
which is wrong IMHO.

Please let me know if I misunderstood or I was wrong with something.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-29 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
@cloud-fan thanks for your feedback. Honestly I don't have a string opinion 
on how we should behave in that case. That's why I wanted some feedbacks from 
the community on that point, while this PR covers only the other case, ie. 
rounding in case of precision loss in accordance to Hive and SQLServer's 
behavior.

Anyway, IIUC now this PR is blocked by #18853 to introduce 
`spark.sql.typeCoercion.mode` and use that property to decide whether to keep 
the previous behavior or the one proposed here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-29 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20029
  
> The hiveClient created for the resourceLoader is only used to addJar, 
which is, in turn, to add Jar to the shared IsolatedClientLoader. Then we can 
just use the shared hive client for this purpose.

@liufengdb does it mean that we are creating more than one SparkSession in 
the thriftserver?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...

2017-12-29 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20010
  
@gczsjdy @bdrillard the test errors are valid. In some cases an exception 
is expected to be thrown, but it isn't, due to the fix. So they should be 
updated accordingly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...

2017-12-29 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19993
  
@jkbradley maybe you can also help reviewing this, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

2017-12-29 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19492
  
kindly ping @viirya @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2017-12-30 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r159119696
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -148,6 +160,61 @@ object TypeCoercion {
 case (l, r) => None
   }
 
+  /**
+   * Case 2 type widening over complex types. `widerTypeFunc` is a 
function that finds the wider
+   * type over point types. The `widerTypeFunc` specifies behavior over 
whether types should be
+   * promoted to StringType.
+   */
+  private def findWiderTypeForTwoComplex(
+  t1: DataType,
+  t2: DataType,
+  widerTypeFunc: (DataType, DataType) => Option[DataType]): 
Option[DataType] = {
+(t1, t2) match {
+  case (_, _) if t1 == t2 => Some(t1)
+  case (NullType, _) => Some(t1)
+  case (_, NullType) => Some(t1)
+
+  case (ArrayType(pointType1, nullable1), ArrayType(pointType2, 
nullable2)) =>
+val dataType = widerTypeFunc.apply(pointType1, pointType2)
+
+dataType.map(ArrayType(_, nullable1 || nullable2))
+
+  case (MapType(keyType1, valueType1, nullable1), MapType(keyType2, 
valueType2, nullable2)) =>
+val keyType = widerTypeFunc.apply(keyType1, keyType2)
+val valueType = widerTypeFunc.apply(valueType1, valueType2)
+
+if (keyType.nonEmpty && valueType.nonEmpty) {
+  Some(MapType(keyType.get, valueType.get, nullable1 || nullable2))
+} else {
+  None
+}
+
+  case (StructType(fields1), StructType(fields2)) =>
+val fieldTypes = fields1.zip(fields2).map { case (f1, f2) =>
--- End diff --

this requires that fields with the same name must also be in the same 
position. Is this assumption correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-30 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
thanks @gatorsmile, do you have any suggestion for the conf name?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19224: [SPARK-20990][SQL] Read all JSON documents in files when...

2017-12-30 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19224
  
yes, next week I will rebase it on top of current master. Thanks @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160114031
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -102,10 +102,12 @@ case class HashAggregateExec(
 
   val beforeAgg = System.nanoTime()
   val hasInput = iter.hasNext
-  val res = if (!hasInput && groupingExpressions.nonEmpty) {
-// This is a grouped aggregate and the input iterator is empty,
+  val res = if (!hasInput && (groupingExpressions.nonEmpty || 
resultExpressions.isEmpty)) {
+// The input iterator is empty, and this is a grouped aggregate or 
without result columns,
 // so return an empty iterator.
 Iterator.empty
+  } else if (!hasInput && resultExpressions.isEmpty) {
--- End diff --

This is covered in the previous condition, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20010#discussion_r160115072
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -462,27 +507,139 @@ class TypeCoercionSuite extends AnalysisTest {
   ArrayType(DoubleType, containsNull = false),
   Some(ArrayType(DoubleType, containsNull = true)))
 widenTestWithStringPromotion(
-  ArrayType(TimestampType, containsNull = false),
-  ArrayType(StringType, containsNull = true),
+  ArrayType(ArrayType(IntegerType), containsNull = true),
+  ArrayType(ArrayType(LongType), containsNull = false),
+  Some(ArrayType(ArrayType(LongType), containsNull = true)))
--- End diff --

I think this is more coherent with the behavior in other parts and this is 
the behavior I would expect. But I think that we should follow @gatorsmile's 
suggestion and check Hive's behavior first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r160143590
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -249,6 +250,27 @@ object ParamValidators {
   def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { 
(value: Array[T]) =>
 value.length > lowerBound
   }
+
+  /**
+   * Checks that either inputCols and outputCols are set or inputCol and 
outputCol are set. If
+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def checkMultiColumnParams(model: Params): Unit = {
+model match {
+  case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && 
m.isSet(m.inputCol) =>
+raiseIncompatibleParamsException("inputCols", "inputCol")
+  case m: HasOutputCols with HasOutputCol if m.isSet(m.outputCols) && 
m.isSet(m.outputCol) =>
+raiseIncompatibleParamsException("outputCols", "outputCol")
+  case _ =>
+}
--- End diff --

I think we can use that method once merged, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r160147366
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType {
   val MAX_SCALE = 38
   val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)
   val USER_DEFAULT: DecimalType = DecimalType(10, 0)
+  val MINIMUM_ADJUSTED_SCALE = 6
--- End diff --

Yes, I followed Hive's implementation which works like this and applies 
this 6 digits minimum to all operations. This means that SQLServer allows to 
round more digits than us in those cases, ie. we ensure at least 6 digits for 
the scale, while SQLServer doesn't.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19492
  
any more comments on this @viirya @HyukjinKwon?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19224: [SPARK-20990][SQL] Read all JSON documents in files when...

2018-01-08 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19224
  
@gatorsmile I think now it is ready for review, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20189: [SPARK-22975] MetricsReporter should not throw ex...

2018-01-08 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20189

[SPARK-22975] MetricsReporter should not throw exception when there was no 
progress reported

## What changes were proposed in this pull request?

`MetricsReporter ` assumes that there has been some progress for the query, 
ie. `lastProgress` is not null. If this is not true, as it might happen in 
particular conditions, a `NullPointerException` can be thrown.

The PR checks whether there is a `lastProgress` and if this is not true, it 
returns a default value for the metrics.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22975

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20189


commit 1185a513bd807df03a24b22370903d17301fd415
Author: Marco Gaido 
Date:   2018-01-08T22:28:12Z

[SPARK-22975] MetricsReporter should not throw exception when there was no 
progress reported




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-09 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r160361785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType {
   val MAX_SCALE = 38
   val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)
   val USER_DEFAULT: DecimalType = DecimalType(10, 0)
+  val MINIMUM_ADJUSTED_SCALE = 6
--- End diff --

@gatorsmile what about `spark.sql.decimalOperations.mode` which defaults to 
`native` and accepts also `hive` (and in future also `sql2011` for throwing 
exception instead of returning NULL)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-09 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r160394589
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -117,6 +117,7 @@ object DecimalType extends AbstractDataType {
   val MAX_SCALE = 38
   val SYSTEM_DEFAULT: DecimalType = DecimalType(MAX_PRECISION, 18)
   val USER_DEFAULT: DecimalType = DecimalType(10, 0)
+  val MINIMUM_ADJUSTED_SCALE = 6
--- End diff --

ok, I'll go with that, thanks @cloud-fan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20174
  
@liancheng @liufengdb if the problem affects only the `dropDuplicates` 
method, can't we just check there if there are no columns and avoid doing 
anything at all in that case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...

2018-01-10 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20189
  
@tdas @zsxwing may I kindly ask if you might take a look at this? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20219: [SPARK-23025][SQL] Support Null type in scala ref...

2018-01-10 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20219

[SPARK-23025][SQL] Support Null type in scala reflection

## What changes were proposed in this pull request?

Add support for `Null` type in the `schemaFor` method for Scala reflection.

## How was this patch tested?

Added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23025

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20219


commit 98010236c013b24c0ecfa6357efaacba1a30ee84
Author: Marco Gaido 
Date:   2018-01-10T12:08:37Z

[SPARK-23025][SQL] Support Null type in scala reflection




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20219: [SPARK-23025][SQL] Support Null type in scala ref...

2018-01-10 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20219#discussion_r160687957
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1441,6 +1441,13 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   assert(e.getCause.isInstanceOf[NullPointerException])
 }
   }
+
+  test("SPARK-23025: Add support for null type in scala reflection") {
+val data = Seq(("a", null))
+checkDataset(
+  data.toDS(),
+  data: _*)
--- End diff --

sure, since in this file in many places this syntax is used, should I 
change also other usages accordingly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20174
  
I see, thnaks for your answer @liancheng 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160772523
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
 ---
@@ -198,6 +198,20 @@ class ReplaceOperatorSuite extends PlanTest {
 comparePlans(optimized, correctAnswer)
   }
 
+  test("add one grouping key if necessary when replace Deduplicate with 
Aggregate") {
+val input = LocalRelation()
+val query = Deduplicate(Seq.empty, input) // dropDuplicates()
+val optimized = Optimize.execute(query.analyze)
+
+val correctAnswer =
--- End diff --

nit: this can be all in one line


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...

2018-01-11 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20226
  
LGTM too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20189: [SPARK-22975][SS] MetricsReporter should not throw excep...

2018-01-11 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20189
  
@zsxwing the test you suggested fails on Jenkins while locally it passes. I 
am no sure about the reason since I cannot reproduce it. I had the same test 
error in my previous trials. Previously the problem was that running other test 
cases before this one produced the issue. May you please advice what to do? Can 
I revert to my previous test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2018-01-12 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20023
  
any more comments @gatorsmile @cloud-fan @dongjoon-hyun @viirya @hvanhovell 
?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19224: [SPARK-20990][SQL] Read all JSON documents in files when...

2018-01-12 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19224
  
kindly ping @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2479 matches

Mail list logo