[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22638#discussion_r222957505
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
@@ -127,16 +127,16 @@ class DatasetCacheSuite extends QueryTest with 
SharedSQLContext with TimeLimits
   }
 
   test("cache UDF result correctly") {
-val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
-val df = spark.range(0, 10).toDF("a").withColumn("b", 
expensiveUDF($"a"))
+val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
--- End diff --

@mgaido91 Thanks marco.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22638#discussion_r222946564
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
@@ -127,16 +127,16 @@ class DatasetCacheSuite extends QueryTest with 
SharedSQLContext with TimeLimits
   }
 
   test("cache UDF result correctly") {
-val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
-val df = spark.range(0, 10).toDF("a").withColumn("b", 
expensiveUDF($"a"))
+val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
--- End diff --

@mgaido91 OK, please correct me on this one. So we insert 2 rows .. i.e two 
invocation of the UDF amounting to 2 * 2sec  = 4 secs of execution. So wouldn't 
a 2 sec fail time be ok ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22638
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

2018-10-05 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22638

[SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: cache 
UDF result correctly

## What changes were proposed in this pull request?
In this test case, we are verifying that the result of an UDF  is cached 
when the underlying data frame is cached and that the udf is not evaluated 
again when the cached data frame is used.

To reduce the runtime we do : 
1) Use a single partition dataframe, so the total execution time of UDF is 
more deterministic.
2) Cut down the size of the dataframe from 10 to 2.
3) Reduce the sleep time in the UDF from 5secs to 2secs.
4) Reduce the failafter condition from 3 to 2.

With the above change, it takes about 4 secs to cache the first dataframe. 
And subsequent check takes a few hundred milliseconds.
The new runtime for 5 consecutive runs of this test is as follows : 
```
[info] - cache UDF result correctly (4 seconds, 906 milliseconds)
[info] - cache UDF result correctly (4 seconds, 281 milliseconds)
[info] - cache UDF result correctly (4 seconds, 288 milliseconds)
[info] - cache UDF result correctly (4 seconds, 355 milliseconds)
[info] - cache UDF result correctly (4 seconds, 280 milliseconds)
```
## How was this patch tested?
This is s test fix.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25610

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22638


commit 97d18feeb8713da42b9f97d2343063bac1cba4b6
Author: Dilip Biswal 
Date:   2018-10-05T07:51:35Z

[SPARK-25610][TEST] Improve execution time of DatasetCacheSuite: cache UDF 
result correctly




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22622: [SPARK-25635][SQL][BUILD] Support selective direct encod...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22622
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...

2018-10-05 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22047
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22619: [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findT...

2018-10-04 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22619
  
@HyukjinKwon Sure :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...

2018-10-04 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22047
  
@gatorsmile First of all, thank you very much . Actually the added 
aggregates weren't null filtering. I have fixed the issue and have added 
additional test cases. Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22619: [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findT...

2018-10-04 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22619
  
@HyukjinKwon Does this look okay now ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20433: [SPARK-23264][SQL] Make INTERVAL keyword optional...

2018-10-04 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20433#discussion_r222552685
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -335,6 +335,12 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ANSI_SQL_PARSER =
--- End diff --

@maropu ok.. cool. I am now wondering about a config i created 
`spark.sql.legacy.setopsPrecedence.enabled` a while back . Should the set 
operation precedence be also controlled by this new config ? cc @gatorsmile for 
his input.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20433: [SPARK-23264][SQL] Make INTERVAL keyword optional...

2018-10-04 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20433#discussion_r222550361
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -335,6 +335,12 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ANSI_SQL_PARSER =
--- End diff --

@maropu OK.. may be i am misunderstanding. Basically i was thinking from an 
user's perspective. When i set this config, what would be my expectation from 
the system ?
I was asking, since we only make use of this config while parsing an 
interval value. Is the expectation that, we will eventually put more and more 
standard complaint features under this config ? So basically we are creating a 
generic config ahead in time. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20433: [SPARK-23264][SQL] Make INTERVAL keyword optional...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20433#discussion_r222544245
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -335,6 +335,12 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ANSI_SQL_PARSER =
--- End diff --

@maropu Isn't this config too generic ? When this is set to true, can we 
confidently say that we follow ANSI standard in all other rules ? I don't know 
if DDLs are part of standard or not. If they are, then we may not be ansi 
compliant , right ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22619: [SQL][MINOR] Make use of TypeCoercion.findTightes...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22619#discussion_r222199535
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -172,51 +172,35 @@ private[csv] object CSVInferSchema {
 StringType
   }
 
-  private val numericPrecedence: IndexedSeq[DataType] = 
TypeCoercion.numericPrecedence
-
   /**
-   * Copied from internal Spark api
-   * [[org.apache.spark.sql.catalyst.analysis.TypeCoercion]]
+   * Returns the common data type given two input data types so that the 
return type
+   * is compatible with both input data types.
*/
-  val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
-case (t1, t2) if t1 == t2 => Some(t1)
-case (NullType, t1) => Some(t1)
-case (t1, NullType) => Some(t1)
-case (StringType, t2) => Some(StringType)
-case (t1, StringType) => Some(StringType)
-
-// Promote numeric types to the highest of the two and all numeric 
types to unlimited decimal
-case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
-  val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2)
-  Some(numericPrecedence(index))
-
-// These two cases below deal with when `DecimalType` is larger than 
`IntegralType`.
-case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
-  Some(t2)
-case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
-  Some(t1)
-
-// These two cases below deal with when `IntegralType` is larger than 
`DecimalType`.
-case (t1: IntegralType, t2: DecimalType) =>
-  findTightestCommonType(DecimalType.forType(t1), t2)
-case (t1: DecimalType, t2: IntegralType) =>
-  findTightestCommonType(t1, DecimalType.forType(t2))
-
-// Double support larger range than fixed decimal, DecimalType.Maximum 
should be enough
-// in most case, also have better precision.
-case (DoubleType, _: DecimalType) | (_: DecimalType, DoubleType) =>
-  Some(DoubleType)
-
-case (t1: DecimalType, t2: DecimalType) =>
-  val scale = math.max(t1.scale, t2.scale)
-  val range = math.max(t1.precision - t1.scale, t2.precision - 
t2.scale)
-  if (range + scale > 38) {
-// DecimalType can't support precision > 38
-Some(DoubleType)
-  } else {
-Some(DecimalType(range + scale, scale))
+  def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
+TypeCoercion.findTightestCommonType(t1, t2).orElse {
+  (t1, t2) match {
--- End diff --

@HyukjinKwon Did you have any preference or suggestion on the name of the 
val ? findCommonTypeExtended ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22619: [SQL][MINOR] Make use of TypeCoercion.findTightes...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22619#discussion_r222198784
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -172,51 +172,35 @@ private[csv] object CSVInferSchema {
 StringType
   }
 
-  private val numericPrecedence: IndexedSeq[DataType] = 
TypeCoercion.numericPrecedence
-
   /**
-   * Copied from internal Spark api
-   * [[org.apache.spark.sql.catalyst.analysis.TypeCoercion]]
+   * Returns the common data type given two input data types so that the 
return type
+   * is compatible with both input data types.
*/
-  val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
-case (t1, t2) if t1 == t2 => Some(t1)
-case (NullType, t1) => Some(t1)
-case (t1, NullType) => Some(t1)
-case (StringType, t2) => Some(StringType)
-case (t1, StringType) => Some(StringType)
-
-// Promote numeric types to the highest of the two and all numeric 
types to unlimited decimal
-case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
-  val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2)
-  Some(numericPrecedence(index))
-
-// These two cases below deal with when `DecimalType` is larger than 
`IntegralType`.
-case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
-  Some(t2)
-case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
-  Some(t1)
-
-// These two cases below deal with when `IntegralType` is larger than 
`DecimalType`.
-case (t1: IntegralType, t2: DecimalType) =>
-  findTightestCommonType(DecimalType.forType(t1), t2)
-case (t1: DecimalType, t2: IntegralType) =>
-  findTightestCommonType(t1, DecimalType.forType(t2))
-
-// Double support larger range than fixed decimal, DecimalType.Maximum 
should be enough
-// in most case, also have better precision.
-case (DoubleType, _: DecimalType) | (_: DecimalType, DoubleType) =>
-  Some(DoubleType)
-
-case (t1: DecimalType, t2: DecimalType) =>
-  val scale = math.max(t1.scale, t2.scale)
-  val range = math.max(t1.precision - t1.scale, t2.precision - 
t2.scale)
-  if (range + scale > 38) {
-// DecimalType can't support precision > 38
-Some(DoubleType)
-  } else {
-Some(DecimalType(range + scale, scale))
+  def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
--- End diff --

@viirya i kept the same name used in JsonInferSchema. Change that as well ? 
Or only change this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22619: [SQL][MINOR] Make use of TypeCoercion.findTightes...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22619#discussion_r222198591
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -172,51 +172,35 @@ private[csv] object CSVInferSchema {
 StringType
   }
 
-  private val numericPrecedence: IndexedSeq[DataType] = 
TypeCoercion.numericPrecedence
-
   /**
-   * Copied from internal Spark api
-   * [[org.apache.spark.sql.catalyst.analysis.TypeCoercion]]
+   * Returns the common data type given two input data types so that the 
return type
+   * is compatible with both input data types.
*/
-  val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
-case (t1, t2) if t1 == t2 => Some(t1)
-case (NullType, t1) => Some(t1)
-case (t1, NullType) => Some(t1)
-case (StringType, t2) => Some(StringType)
-case (t1, StringType) => Some(StringType)
-
-// Promote numeric types to the highest of the two and all numeric 
types to unlimited decimal
-case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
-  val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2)
-  Some(numericPrecedence(index))
-
-// These two cases below deal with when `DecimalType` is larger than 
`IntegralType`.
-case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
-  Some(t2)
-case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
-  Some(t1)
-
-// These two cases below deal with when `IntegralType` is larger than 
`DecimalType`.
-case (t1: IntegralType, t2: DecimalType) =>
-  findTightestCommonType(DecimalType.forType(t1), t2)
-case (t1: DecimalType, t2: IntegralType) =>
-  findTightestCommonType(t1, DecimalType.forType(t2))
-
-// Double support larger range than fixed decimal, DecimalType.Maximum 
should be enough
-// in most case, also have better precision.
--- End diff --

@viirya Yeah.. we should keep.. sorry.. got dropped inadvertently.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22619
  
@HyukjinKwon Okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22448
  
@HyukjinKwon Thanks for checking it out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22619: [SQL][MINOR] Make use of TypeCoercion.findTightes...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22619#discussion_r222195632
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -172,51 +172,35 @@ private[csv] object CSVInferSchema {
 StringType
   }
 
-  private val numericPrecedence: IndexedSeq[DataType] = 
TypeCoercion.numericPrecedence
-
   /**
-   * Copied from internal Spark api
-   * [[org.apache.spark.sql.catalyst.analysis.TypeCoercion]]
+   * Returns the common data type given two input data types so that the 
return type
+   * is compatible with both input data types.
*/
-  val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
-case (t1, t2) if t1 == t2 => Some(t1)
-case (NullType, t1) => Some(t1)
-case (t1, NullType) => Some(t1)
-case (StringType, t2) => Some(StringType)
-case (t1, StringType) => Some(StringType)
-
-// Promote numeric types to the highest of the two and all numeric 
types to unlimited decimal
-case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
-  val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2)
-  Some(numericPrecedence(index))
-
-// These two cases below deal with when `DecimalType` is larger than 
`IntegralType`.
-case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
-  Some(t2)
-case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
-  Some(t1)
-
-// These two cases below deal with when `IntegralType` is larger than 
`DecimalType`.
-case (t1: IntegralType, t2: DecimalType) =>
-  findTightestCommonType(DecimalType.forType(t1), t2)
-case (t1: DecimalType, t2: IntegralType) =>
-  findTightestCommonType(t1, DecimalType.forType(t2))
-
-// Double support larger range than fixed decimal, DecimalType.Maximum 
should be enough
-// in most case, also have better precision.
-case (DoubleType, _: DecimalType) | (_: DecimalType, DoubleType) =>
-  Some(DoubleType)
-
-case (t1: DecimalType, t2: DecimalType) =>
-  val scale = math.max(t1.scale, t2.scale)
-  val range = math.max(t1.precision - t1.scale, t2.precision - 
t2.scale)
-  if (range + scale > 38) {
-// DecimalType can't support precision > 38
-Some(DoubleType)
-  } else {
-Some(DecimalType(range + scale, scale))
+  def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
+TypeCoercion.findTightestCommonType(t1, t2).orElse {
+  (t1, t2) match {
+case (StringType, t2) => Some(StringType)
+case (t1, StringType) => Some(StringType)
+
+case (t1: IntegralType, t2: DecimalType) =>
+  compatibleType(DecimalType.forType(t1), t2)
+case (t1: DecimalType, t2: IntegralType) =>
+  compatibleType(t1, DecimalType.forType(t2))
+
+case (DoubleType, _: DecimalType) | (_: DecimalType, DoubleType) =>
+  Some(DoubleType)
+
+case (t1: DecimalType, t2: DecimalType) =>
+  val scale = math.max(t1.scale, t2.scale)
+  val range = math.max(t1.precision - t1.scale, t2.precision - 
t2.scale)
+  if (range + scale > 38) {
+// DecimalType can't support precision > 38
+Some(DoubleType)
+  } else {
+Some(DecimalType(range + scale, scale))
+  }
+case (_, _) => None
--- End diff --

@HyukjinKwon Thanks. Will change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22619
  
@ueshin 
> Maybe this is related to #22448.
Yeah.. Actually @MaxGekk had pointed me to the presence of duplicate code 
in one of his comment. I was trying to address it in here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...

2018-10-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22619
  
@gatorsmile There should not be any behaviour change. I was thinking that 
existing test cases should suffice. Basically we used to duplicate the code of 
TypeCoercion.findTightestCommonType in here. Here i am just reusing the common 
function. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22619: [SQL][MINOR] Make use of TypeCoercion.findTightestCommon...

2018-10-02 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22619
  
cc @HyukjinKwon @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22619: [SQL][MINOR] Make use of TypeCoercion.findTightes...

2018-10-02 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22619

[SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while 
inferring CSV schema.

## What changes were proposed in this pull request?
Current the CSV's infer schema code inlines 
`TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of 
the common type coercion code when applicable.  This way we can take advantage 
of any improvement to the base method.

Thanks to @MaxGekk for finding this while reviewing another PR.

## How was this patch tested?
This is a minor refactor.  Existing tests are used to verify the change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark csv_minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22619.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22619


commit d4e0bdb5a06f59ff1df3933cb43804e71cde8259
Author: Dilip Biswal 
Date:   2018-10-03T00:47:42Z

Make use of TypeCoercion.findTightestCommonType while inferring CSV schema




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22047: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...

2018-10-02 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22047
  
@gatorsmile Thanks.. I will check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANALYZE TA...

2018-09-28 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22566
  
Thank you very much @gatorsmile @dongjoon-hyun @juliuszsompolski 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANALYZE TA...

2018-09-28 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22566
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-28 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22566#discussion_r221159987
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -653,6 +653,49 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("collecting statistics for all columns") {
+val table = "update_col_stats_table"
+withTable(table) {
+  sql(s"CREATE TABLE $table (c1 INT, c2 STRING, c3 DOUBLE)")
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR ALL COLUMNS")
+  val fetchedStats0 =
+checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = 
Some(0))
+  assert(fetchedStats0.get.colStats == Map(
+"c1" -> CatalogColumnStat(distinctCount = Some(0), min = None, max 
= None,
+  nullCount = Some(0), avgLen = Some(4), maxLen = Some(4)),
+"c3" -> CatalogColumnStat(distinctCount = Some(0), min = None, max 
= None,
+  nullCount = Some(0), avgLen = Some(8), maxLen = Some(8)),
+"c2" -> CatalogColumnStat(distinctCount = Some(0), min = None, max 
= None,
+  nullCount = Some(0), avgLen = Some(20), maxLen = Some(20
+
+  // Insert new data and analyze: have the latest column stats.
+  sql(s"INSERT INTO TABLE $table SELECT 1, 'a', 10.0")
+  sql(s"INSERT INTO TABLE $table SELECT 1, 'b', null")
+
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR ALL COLUMNS")
+  val fetchedStats1 =
+checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = 
Some(2))
+  assert(fetchedStats1.get.colStats == Map(
+"c1" -> CatalogColumnStat(distinctCount = Some(1), min = 
Some("1"), max = Some("1"),
+  nullCount = Some(0), avgLen = Some(4), maxLen = Some(4)),
+"c3" -> CatalogColumnStat(distinctCount = Some(1), min = 
Some("10.0"), max = Some("10.0"),
+  nullCount = Some(1), avgLen = Some(8), maxLen = Some(8)),
+"c2" -> CatalogColumnStat(distinctCount = Some(2), min = None, max 
= None,
+  nullCount = Some(0), avgLen = Some(1), maxLen = Some(1
+
+  val e1 = intercept[IllegalArgumentException] {
+AnalyzeColumnCommand(TableIdentifier(table), Option(Seq("c1")), 
true).run(spark)
+  }
+  assert(e1.getMessage.contains("Parameter `columnNames` or 
`allColumns` are" +
+" mutually exclusive"))
+  val e2 = intercept[IllegalArgumentException] {
+AnalyzeColumnCommand(TableIdentifier(table), None, 
false).run(spark)
+  }
+  assert(e1.getMessage.contains("Parameter `columnNames` or 
`allColumns` are" +
+" mutually exclusive"))
--- End diff --

@gatorsmile Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-28 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22566#discussion_r221159433
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -48,9 +52,13 @@ case class AnalyzeColumnCommand(
   throw new AnalysisException("ANALYZE TABLE is not supported on 
views.")
 }
 val sizeInBytes = CommandUtils.calculateTotalSize(sparkSession, 
tableMeta)
+val conf = sparkSession.sessionState.conf
--- End diff --

@gatorsmile Yeah... Thanks.. will remove.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22544: [SPARK-25522][SQL] Improve type promotion for input argu...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22544
  
@cloud-fan Sounds good Wenchen. Thanks for the detailed analysis.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22566#discussion_r221029986
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -50,7 +52,26 @@ case class AnalyzeColumnCommand(
 val sizeInBytes = CommandUtils.calculateTotalSize(sparkSession, 
tableMeta)
 
 // Compute stats for each column
-val (rowCount, newColStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+val conf = sparkSession.sessionState.conf
+val relation = sparkSession.table(tableIdent).logicalPlan
+val attributesToAnalyze = if (allColumns) {
+  relation.output
--- End diff --

@gatorsmile I think we don't allow it any more. Please see 
[test](https://github.com/dilipbiswal/spark/blob/7920b29f667de38ca5222e4de8f62eea688d67e4/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L111:L136).
 Please let me know what you think..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22566#discussion_r220986964
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -50,7 +52,26 @@ case class AnalyzeColumnCommand(
 val sizeInBytes = CommandUtils.calculateTotalSize(sparkSession, 
tableMeta)
 
 // Compute stats for each column
-val (rowCount, newColStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+val conf = sparkSession.sessionState.conf
+val relation = sparkSession.table(tableIdent).logicalPlan
+val attributesToAnalyze = if (allColumns) {
+  relation.output
+} else {
+  columnNames.get.map { col =>
+val exprOption = relation.output.find(attr => 
conf.resolver(attr.name, col))
+exprOption.getOrElse(throw new AnalysisException(s"Column $col 
does not exist."))
+  }
+}
+// Make sure the column types are supported for stats gathering.
+attributesToAnalyze.foreach { attr =>
+  if (!supportsType(attr.dataType)) {
+throw new AnalysisException(
+  s"Column ${attr.name} in table $tableIdent is of type 
${attr.dataType}, " +
+"and Spark does not support statistics collection on this 
column type.")
+  }
+}
--- End diff --

@gatorsmile 
> creating a new private function for the code between 55 and 72
Actually, we may not be able to refactor from 55 to 72. We need to compute 
the `relation` once. I will do the following outside and call the proposed 
private method.
```
 val conf = sparkSession.sessionState.conf
 val relation = sparkSession.table(tableIdent).logicalPlan
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22566#discussion_r220985607
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -50,7 +52,26 @@ case class AnalyzeColumnCommand(
 val sizeInBytes = CommandUtils.calculateTotalSize(sparkSession, 
tableMeta)
 
 // Compute stats for each column
-val (rowCount, newColStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+val conf = sparkSession.sessionState.conf
+val relation = sparkSession.table(tableIdent).logicalPlan
+val attributesToAnalyze = if (allColumns) {
+  relation.output
+} else {
+  columnNames.get.map { col =>
+val exprOption = relation.output.find(attr => 
conf.resolver(attr.name, col))
+exprOption.getOrElse(throw new AnalysisException(s"Column $col 
does not exist."))
+  }
+}
+// Make sure the column types are supported for stats gathering.
+attributesToAnalyze.foreach { attr =>
+  if (!supportsType(attr.dataType)) {
+throw new AnalysisException(
+  s"Column ${attr.name} in table $tableIdent is of type 
${attr.dataType}, " +
+"and Spark does not support statistics collection on this 
column type.")
+  }
+}
--- End diff --

@gatorsmile OK


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22566#discussion_r220985491
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -33,11 +33,13 @@ import org.apache.spark.sql.types._
 
 /**
  * Analyzes the given columns of the given table to generate statistics, 
which will be used in
- * query optimizations.
+ * query optimizations. Parameter `allColumns` may be specified to 
generate statistics of all the
+ * columns of a given table.
  */
 case class AnalyzeColumnCommand(
 tableIdent: TableIdentifier,
-columnNames: Seq[String]) extends RunnableCommand {
+columnNames: Option[Seq[String]],
+allColumns: Boolean = false ) extends RunnableCommand {
--- End diff --

@gatorsmile ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22566: [SPARK-25458][SQL] Support FOR ALL COLUMNS in ANA...

2018-09-27 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22566

[SPARK-25458][SQL] Support FOR ALL COLUMNS in ANALYZE TABLE

## What changes were proposed in this pull request?
**Description from the JIRA :** 
Currently, to collect the statistics of all the columns, users need to 
specify the names of all the columns when calling the command "ANALYZE TABLE 
... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the 
following SQL command to achieve it without specifying the column names.

```
   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
```

## How was this patch tested?
Added new tests in SparkSqlParserSuite and StatisticsSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25458

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22566


commit d0da8bd88c351f3b679f539ae2e59bda8e6cbe71
Author: Dilip Biswal 
Date:   2018-09-19T04:07:40Z

[SPARK-25458] Support FOR ALL COLUMNS in ANALYZE TABLE




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22563: [SPARK-24341][SQL][followup] remove duplicated error che...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22563
  
@cloud-fan oops... sorry.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22563: [SPARK-24341][SQL][followup] remove duplicated error che...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22563
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22561: [SPARK-25548][SQL]In the PruneFileSourcePartitions optim...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22561
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22562: [SPARK-25541][SQL][FOLLOWUP] Remove overriding filterKey...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22562
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22558: [SPARK-25546][core] Don't cache value of EVENT_LOG_CALLS...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22558
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22544: [SPARK-25522][SQL] Improve type promotion for input argu...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22544
  
@cloud-fan @ueshin @viirya Thank you very much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22544: [SPARK-25522][SQL] Improve type promotion for input argu...

2018-09-27 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22544
  
@cloud-fan Are we targeting this for 2.4 ? If so, i was wondering if we 
should make a restricted fix that affects only this function by adding a case 
in `FunctionArgumentConversion` for 2.4 while keeping this fix for master. Pl. 
let me know what you think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220766294
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -974,6 +974,25 @@ object TypeCoercion {
 if !Cast.forceNullable(fromType, toType) =>
   implicitCast(fromType, toType).map(ArrayType(_, false)).orNull
 
+// Implicit cast between Map types.
+// Follows the same semantics of implicit casting between two 
array types.
+// Refer to documentation above. Make sure that both key and values
+// can not by null after the implicit cast operation by calling 
forceNullable
--- End diff --

@viirya Thank you. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22560: [SPARK-25547][Spark Core] Pluggable JDBC connecti...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22560#discussion_r220757697
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
 ---
@@ -65,4 +67,60 @@ class JdbcUtilsSuite extends SparkFunSuite {
 }
 assert(mismatchedInput.getMessage.contains("mismatched input '.' 
expecting"))
   }
+
+  test("Custom ConnectionFactory falls back to default when no factory 
specified") {
+val options = new JDBCOptions(Map[String, String](
+  "url" -> "jdbc:mysql://foo.com/bar",
+  "dbtable" -> "table")
+)
+assert(options.connectionFactoryProvider eq 
DefaultConnectionFactoryProvider)
+  }
+
+  test("JdbcUtils uses the specified connection factory") {
+val options = new JDBCOptions(Map[String, String](
+  "url" -> "jdbc:mysql://foo.com/bar",
+  "dbtable" -> "table",
+  JDBCOptions.JDBC_CONNECTION_FACTORY_PROVIDER ->
+"org.apache.spark.sql.execution.datasources.jdbc.TestFactory")
+)
+val x = intercept[RuntimeException] {
+  JdbcUtils.createConnectionFactory(options)()
+}
+assert(x.getMessage == "This message will be tested in test")
+  }
+
+  test("invalid connection factory throws IllegalArgumentException") {
+
+val nonexisting = intercept[IllegalArgumentException] {
+  new JDBCOptions(Map[String, String](
+"url" -> "jdbc:mysql://foo.com/bar",
+"dbtable" -> "table",
+JDBCOptions.JDBC_CONNECTION_FACTORY_PROVIDER -> "notexistingclass")
+  )
+}
+assert(nonexisting.getMessage == "notexistingclass is not a valid 
ConnectionFactoryProvider")
+
+val missingTrait = intercept[IllegalArgumentException] {
+  new JDBCOptions(Map[String, String](
+"url" -> "jdbc:mysql://foo.com/bar",
+"dbtable" -> "table",
+JDBCOptions.JDBC_CONNECTION_FACTORY_PROVIDER ->
+  "org.apache.spark.sql.execution.datasources.jdbc.BadFactory")
+  )
+}
+assert(missingTrait.getMessage ==
+  "org.apache.spark.sql.execution.datasources.jdbc.BadFactory" +
+" is not a valid ConnectionFactoryProvider")
+
+  }
+}
+
+// this one does not implement ConnectionFactoryProvider
+class BadFactory {
+
+}
+
+class TestFactory extends ConnectionFactoryProvider {
+  override def createConnectionFactory(options: JDBCOptions): () => 
Connection =
--- End diff --

I see. OK. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22560: [SPARK-25547][Spark Core] Pluggable JDBC connecti...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22560#discussion_r220751466
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtilsSuite.scala
 ---
@@ -65,4 +67,60 @@ class JdbcUtilsSuite extends SparkFunSuite {
 }
 assert(mismatchedInput.getMessage.contains("mismatched input '.' 
expecting"))
   }
+
+  test("Custom ConnectionFactory falls back to default when no factory 
specified") {
+val options = new JDBCOptions(Map[String, String](
+  "url" -> "jdbc:mysql://foo.com/bar",
+  "dbtable" -> "table")
+)
+assert(options.connectionFactoryProvider eq 
DefaultConnectionFactoryProvider)
+  }
+
+  test("JdbcUtils uses the specified connection factory") {
+val options = new JDBCOptions(Map[String, String](
+  "url" -> "jdbc:mysql://foo.com/bar",
+  "dbtable" -> "table",
+  JDBCOptions.JDBC_CONNECTION_FACTORY_PROVIDER ->
+"org.apache.spark.sql.execution.datasources.jdbc.TestFactory")
+)
+val x = intercept[RuntimeException] {
+  JdbcUtils.createConnectionFactory(options)()
+}
+assert(x.getMessage == "This message will be tested in test")
+  }
+
+  test("invalid connection factory throws IllegalArgumentException") {
+
+val nonexisting = intercept[IllegalArgumentException] {
+  new JDBCOptions(Map[String, String](
+"url" -> "jdbc:mysql://foo.com/bar",
+"dbtable" -> "table",
+JDBCOptions.JDBC_CONNECTION_FACTORY_PROVIDER -> "notexistingclass")
+  )
+}
+assert(nonexisting.getMessage == "notexistingclass is not a valid 
ConnectionFactoryProvider")
+
+val missingTrait = intercept[IllegalArgumentException] {
+  new JDBCOptions(Map[String, String](
+"url" -> "jdbc:mysql://foo.com/bar",
+"dbtable" -> "table",
+JDBCOptions.JDBC_CONNECTION_FACTORY_PROVIDER ->
+  "org.apache.spark.sql.execution.datasources.jdbc.BadFactory")
+  )
+}
+assert(missingTrait.getMessage ==
+  "org.apache.spark.sql.execution.datasources.jdbc.BadFactory" +
+" is not a valid ConnectionFactoryProvider")
+
+  }
+}
+
+// this one does not implement ConnectionFactoryProvider
+class BadFactory {
+
+}
+
+class TestFactory extends ConnectionFactoryProvider {
+  override def createConnectionFactory(options: JDBCOptions): () => 
Connection =
--- End diff --

> Without these changes we had to copy most of the spark jdbc package into 
our own codebase
to allow us to create our own connection factory

Trying to understand. So here we are passing JDBCOptions which is a spark 
class. I thought the intention was to make the custom connection factory code 
to not depend on spark classes ? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220746708
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -974,6 +974,33 @@ object TypeCoercion {
 if !Cast.forceNullable(fromType, toType) =>
   implicitCast(fromType, toType).map(ArrayType(_, false)).orNull
 
+// Implicit cast between Map types.
+// Follows the same semantics of implicit casting between two 
array types.
+// Refer to documentation above.
+case (MapType(fromKeyType, fromValueType, fn), MapType(toKeyType, 
toValueType, true))
+if !Cast.forceNullable(fromKeyType, toKeyType) =>
+  val newKeyType = implicitCast(fromKeyType, toKeyType).orNull
+  val newValueType = implicitCast(fromValueType, 
toValueType).orNull
+  if (newKeyType != null && newValueType != null) {
+MapType(newKeyType, newValueType, true)
+  } else {
+null
+  }
+
+case (MapType(fromKeyType, fromValueType, true), 
MapType(toKeyType, toValueType, false)) =>
+ null
+
+case (MapType(fromKeyType, fromValueType, false), 
MapType(toKeyType, toValueType, false))
+if (!(Cast.forceNullable(fromKeyType, toKeyType) ||
+  Cast.forceNullable(fromValueType, toValueType))) =>
+  val newKeyType = implicitCast(fromKeyType, toKeyType).orNull
+  val newValueType = implicitCast(fromValueType, 
toValueType).orNull
+  if (newKeyType != null && newValueType != null) {
+MapType(newKeyType, newValueType, false)
+  } else {
+null
+  }
+
--- End diff --

@cloud-fan Done. Please check when you get a chance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22556: [MINOR] Remove useless InSubquery expression

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22556
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220463070
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +153,56 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * Correctly handle PythonUDF which need access both side of join side by 
changing the new join
+ * type to Cross.
+ */
+object HandlePythonUDFInJoinCondition extends Rule[LogicalPlan] with 
PredicateHelper {
+  def hasPythonUDF(expression: Expression): Boolean = {
+expression.collectFirst { case udf: PythonUDF => udf }.isDefined
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case j @ Join(_, _, joinType, condition)
+  if 
condition.map(splitConjunctivePredicates).getOrElse(Nil).exists(hasPythonUDF) =>
+  if (!joinType.isInstanceOf[InnerLike] && joinType != LeftSemi) {
+// The current strategy only support InnerLike and LeftSemi join 
because for other type,
+// it breaks SQL semantic if we run the join condition as a filter 
after join. If we pass
+// the plan here, it'll still get a an invalid PythonUDF 
RuntimeException with message
+// `requires attributes from more than one child`, we throw 
firstly here for better
+// readable information.
+throw new AnalysisException("Using PythonUDF in join condition of 
join type" +
+  s" $joinType is not supported.")
+  }
+  if (SQLConf.get.crossJoinEnabled) {
--- End diff --

@mgaido91 Its probably because this suite only exercises  one rule of the 
optimizer ? :-)
```
 object Optimize extends RuleExecutor[LogicalPlan] {
val batches = Batch("Check Cartesian Products", Once, 
CheckCartesianProducts) :: Nil
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220456613
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -971,9 +971,38 @@ object TypeCoercion {
 case (ArrayType(fromType, true), ArrayType(toType: DataType, 
false)) => null
 
 case (ArrayType(fromType, false), ArrayType(toType: DataType, 
false))
-if !Cast.forceNullable(fromType, toType) =>
+  if !Cast.forceNullable(fromType, toType) =>
   implicitCast(fromType, toType).map(ArrayType(_, false)).orNull
 
+// Implicit cast between Map types.
+// Follows the same semantics of implicit casting between two 
array types.
+// Refer to documentation above.
+case (MapType(fromKeyType, fromValueType, fn), MapType(toKeyType, 
toValueType, true))
+  if !Cast.forceNullable(fromKeyType, toKeyType) =>
+  val newKeyType = implicitCast(fromKeyType, toKeyType).orNull
+  val newValueType = implicitCast(fromValueType, 
toValueType).orNull
+  if (newKeyType != null && newValueType != null
+&& (!newKeyType.sameType(fromKeyType) || 
!newValueType.sameType(fromValueType))) {
--- End diff --

@ueshin U r right. Let me take it out and run the tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22326
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220440612
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +153,56 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * Correctly handle PythonUDF which need access both side of join side by 
changing the new join
+ * type to Cross.
+ */
+object HandlePythonUDFInJoinCondition extends Rule[LogicalPlan] with 
PredicateHelper {
+  def hasPythonUDF(expression: Expression): Boolean = {
+expression.collectFirst { case udf: PythonUDF => udf }.isDefined
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case j @ Join(_, _, joinType, condition)
+  if 
condition.map(splitConjunctivePredicates).getOrElse(Nil).exists(hasPythonUDF) =>
--- End diff --

@cloud-fan Just saw your comment that validated my understanding on this 
rule relying on pushing down predicates through join. However, the 
pushdownPredicateThroughJoin is not in the nonExcludableRules list. So can we 
rely on this rule being fired always here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220438880
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +153,56 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * Correctly handle PythonUDF which need access both side of join side by 
changing the new join
+ * type to Cross.
+ */
+object HandlePythonUDFInJoinCondition extends Rule[LogicalPlan] with 
PredicateHelper {
+  def hasPythonUDF(expression: Expression): Boolean = {
+expression.collectFirst { case udf: PythonUDF => udf }.isDefined
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case j @ Join(_, _, joinType, condition)
+  if 
condition.map(splitConjunctivePredicates).getOrElse(Nil).exists(hasPythonUDF) =>
--- End diff --

@xuanyuanking Oh..  it is because the UDFS referring to single leg would 
have been pushed down and we will only have UDFs referring to both legs in the 
join condition when we come here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-26 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220436983
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +153,56 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * Correctly handle PythonUDF which need access both side of join side by 
changing the new join
+ * type to Cross.
+ */
+object HandlePythonUDFInJoinCondition extends Rule[LogicalPlan] with 
PredicateHelper {
+  def hasPythonUDF(expression: Expression): Boolean = {
+expression.collectFirst { case udf: PythonUDF => udf }.isDefined
+  }
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case j @ Join(_, _, joinType, condition)
+  if 
condition.map(splitConjunctivePredicates).getOrElse(Nil).exists(hasPythonUDF) =>
--- End diff --

@xuanyuanking So here we are finding out if the join condition has a python 
UDF. I am trying to understand where we are making the determination that this 
python UDF is referring to attributes of both legs of the join ? Can you please 
let me know. Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22326
  
Does it make sense to have a test case where the PythonUDF referencing 
attributes from both legs of join in disjunction with a regular predicate ? i.e 
joinCond = pythonUDF(leftattr, rightattr) || leftattr > 5 ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220428377
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -552,6 +552,92 @@ def test_udf_in_filter_on_top_of_join(self):
 df = left.crossJoin(right).filter(f("a", "b"))
 self.assertEqual(df.collect(), [Row(a=1, b=1)])
 
+def test_udf_in_join_condition(self):
+# regression test for SPARK-25314
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1)])
+right = self.spark.createDataFrame([Row(b=1)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, f("a", "b"))
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, b=1)])
+
+def test_udf_in_left_semi_join_condition(self):
+# regression test for SPARK-25314
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, f("a", "b"), "leftsemi")
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1)])
+
+def test_udf_and_filter_in_join_condition(self):
+# regression test for SPARK-25314
+# test the complex scenario with both udf(non-deterministic)
+# and normal filter(deterministic)
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=2, 
b1=1, b2=2)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, [f("a", "b1"), left.a == 1, right.b == 2])
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1, b=2, 
b1=1, b2=2)])
+
+def test_udf_and_filter_in_left_semi_join_condition(self):
+# regression test for SPARK-25314
+# test the complex scenario with both udf(non-deterministic)
+# and normal filter(deterministic)
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=2, 
b1=1, b2=2)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, [f("a", "b1"), left.a == 1, right.b == 2], 
"left_semi")
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1)])
+
+def test_udf_and_common_filter_in_join_condition(self):
+# regression test for SPARK-25314
+# test the complex scenario with both udf(non-deterministic)
+# and common filter(deterministic)
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=1, 
b1=3, b2=1)])
+f = udf(lambda a, b: a == b, BooleanType())
--- End diff --

same question as above ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220428290
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -552,6 +552,92 @@ def test_udf_in_filter_on_top_of_join(self):
 df = left.crossJoin(right).filter(f("a", "b"))
 self.assertEqual(df.collect(), [Row(a=1, b=1)])
 
+def test_udf_in_join_condition(self):
+# regression test for SPARK-25314
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1)])
+right = self.spark.createDataFrame([Row(b=1)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, f("a", "b"))
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, b=1)])
+
+def test_udf_in_left_semi_join_condition(self):
+# regression test for SPARK-25314
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, f("a", "b"), "leftsemi")
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1)])
+
+def test_udf_and_filter_in_join_condition(self):
+# regression test for SPARK-25314
+# test the complex scenario with both udf(non-deterministic)
+# and normal filter(deterministic)
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=2, 
b1=1, b2=2)])
+f = udf(lambda a, b: a == b, BooleanType())
--- End diff --

is this udf non-deterministic ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22326#discussion_r220428332
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -552,6 +552,92 @@ def test_udf_in_filter_on_top_of_join(self):
 df = left.crossJoin(right).filter(f("a", "b"))
 self.assertEqual(df.collect(), [Row(a=1, b=1)])
 
+def test_udf_in_join_condition(self):
+# regression test for SPARK-25314
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1)])
+right = self.spark.createDataFrame([Row(b=1)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, f("a", "b"))
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, b=1)])
+
+def test_udf_in_left_semi_join_condition(self):
+# regression test for SPARK-25314
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, f("a", "b"), "leftsemi")
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1)])
+
+def test_udf_and_filter_in_join_condition(self):
+# regression test for SPARK-25314
+# test the complex scenario with both udf(non-deterministic)
+# and normal filter(deterministic)
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=2, 
b1=1, b2=2)])
+f = udf(lambda a, b: a == b, BooleanType())
+df = left.join(right, [f("a", "b1"), left.a == 1, right.b == 2])
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1, b=2, 
b1=1, b2=2)])
+
+def test_udf_and_filter_in_left_semi_join_condition(self):
+# regression test for SPARK-25314
+# test the complex scenario with both udf(non-deterministic)
+# and normal filter(deterministic)
+from pyspark.sql.functions import udf
+left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
+right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=2, 
b1=1, b2=2)])
+f = udf(lambda a, b: a == b, BooleanType())
--- End diff --

is this udf non-deterministic ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220253485
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -971,9 +971,36 @@ object TypeCoercion {
 case (ArrayType(fromType, true), ArrayType(toType: DataType, 
false)) => null
 
 case (ArrayType(fromType, false), ArrayType(toType: DataType, 
false))
-if !Cast.forceNullable(fromType, toType) =>
+  if !Cast.forceNullable(fromType, toType) =>
   implicitCast(fromType, toType).map(ArrayType(_, false)).orNull
 
+// Implicit cast between Map types.
+// Follows the same semantics of implicit casting between two 
array types.
+// Refer to documentation above.
+case (MapType(fromKeyType, fromValueType, fn), MapType(toKeyType, 
toValueType, true)) =>
+  val newFromType = implicitCast(fromKeyType, toKeyType).orNull
--- End diff --

@viirya Will change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220250631
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -971,9 +971,36 @@ object TypeCoercion {
 case (ArrayType(fromType, true), ArrayType(toType: DataType, 
false)) => null
 
 case (ArrayType(fromType, false), ArrayType(toType: DataType, 
false))
-if !Cast.forceNullable(fromType, toType) =>
+  if !Cast.forceNullable(fromType, toType) =>
   implicitCast(fromType, toType).map(ArrayType(_, false)).orNull
 
+// Implicit cast between Map types.
--- End diff --

@viirya actually i followed the arraytype example. Actually it has decent 
coverage, i think. If you can think of a case thats missing , i will add. I 
will also go through once again myself.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220250338
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2140,21 +2140,34 @@ case class ElementAt(left: Expression, right: 
Expression) extends GetMapValueUti
   }
 
   override def inputTypes: Seq[AbstractDataType] = {
-Seq(TypeCollection(ArrayType, MapType),
-  left.dataType match {
-case _: ArrayType => IntegerType
-case _: MapType => mapKeyType
-case _ => AnyDataType // no match for a wrong 'left' expression 
type
-  }
-)
+(left.dataType, right.dataType) match {
+  case (ArrayType(e1, hasNull), e2: IntegralType) if (e2 != LongType) 
=>
+Seq(ArrayType(e1, hasNull), IntegerType)
+  case (MapType(keyType, valueType, hasNull), e2) =>
+TypeCoercion.findTightestCommonType(keyType, e2) match {
+  case Some(dt) => Seq(MapType(dt, valueType, hasNull), dt)
+  case _ => Seq.empty
--- End diff --

@viirya Thanks .. i have updated the description. Please let me know if it 
looks okay to you or anything is amiss.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522][SQL] Improve type promotion for inp...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22544#discussion_r220232666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2140,21 +2140,34 @@ case class ElementAt(left: Expression, right: 
Expression) extends GetMapValueUti
   }
 
   override def inputTypes: Seq[AbstractDataType] = {
-Seq(TypeCollection(ArrayType, MapType),
-  left.dataType match {
-case _: ArrayType => IntegerType
-case _: MapType => mapKeyType
-case _ => AnyDataType // no match for a wrong 'left' expression 
type
-  }
-)
+(left.dataType, right.dataType) match {
+  case (ArrayType(e1, hasNull), e2: IntegralType) if (e2 != LongType) 
=>
+Seq(ArrayType(e1, hasNull), IntegerType)
+  case (MapType(keyType, valueType, hasNull), e2) =>
+TypeCoercion.findTightestCommonType(keyType, e2) match {
+  case Some(dt) => Seq(MapType(dt, valueType, hasNull), dt)
+  case _ => Seq.empty
--- End diff --

@viirya 
```select element_at(map(1,"one", 2, "two"), 2.2D);```
In this case, there is no key with value 2.2 in the map, right ? But we 
return "two" as the answer. If we simply cast the right side to the key type of 
the map, then we will cast 2.2 down to 2 and find "two".  But by finding a 
common type, we will compare both sides as double and correctly return "null" 
as the answer.

Here is presto's output : 
```SQL
presto:default> select element_at(MAP(array[1], array[1]), 1.1);
 _col0 
---
 NULL  
(1 row)

Query 20180921_051929_00018_iif98, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> select element_at(MAP(array[1], array[1]), 1.0);
 _col0 
---
 1 
(1 row)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22544: [SPARK-25522] Improve type promotion for input ar...

2018-09-25 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22544

[SPARK-25522] Improve type promotion for input arguments of elementAt 
function

## What changes were proposed in this pull request?
In ElementAt, when first argument is MapType, we should coerce the key type 
and the second argument based on findTightestCommonType. This is not happening 
currently.

Also, when the first argument is ArrayType, the second argument should be 
an integer type or a smaller integral type that can be safely casted to an 
integer type. Currently we may do an unsafe cast.

```SQL
spark-sql> select element_at(array(1,2), 1.24);

1
```
```SQL
spark-sql> select element_at(map(1,"one", 2, "two"), 2.2);

two
```
This PR also supports implicit cast between two MapTypes. I have followed 
similar logic that exists today to do implicit casts between two array types.
## How was this patch tested?
Added new tests in DataFrameFunctionSuite, TypeCoercionSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25522

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22544


commit 4d32f2ce2ff4d5e13c693053041d3111526662cb
Author: Dilip Biswal 
Date:   2018-09-16T02:36:18Z

[SPARK-25522] Improve type promotion for input arguments of elementAt 
function.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-25454][SQL] add a new config for picking minimum ...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22494
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22541
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519][CORE] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIG...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22521
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...

2018-09-25 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22198
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22542
  
@cloud-fan Thanks a lot. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22542
  
@cloud-fan @ueshin  I would like to ask a question here. There is one more 
function i wanted to fix. Originally i wanted to do it as part of this PR.. 
then realized that its not as straight forward and decided to do it in separate 
PR. The function in question is element_at. It turns out that we don't 
currently handle implicit casting between two Map types if the child types are 
different. say Map to Map. I was planning to enhance 
our implicitCast routine handle casting between MapTypes just like we do for 
ArrayType today. Does that sound okay to you ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22542
  
cc @ueshin @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22542: [SPARK-25519][SQL] ArrayRemove function may retur...

2018-09-24 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22542

[SPARK-25519][SQL] ArrayRemove function may return incorrect result when 
right expression is implicitly downcasted.

## What changes were proposed in this pull request?
In ArrayPosition, we currently cast the right hand side expression to match 
the element type of the left hand side Array. This may result in down casting 
and may return wrong result or questionable result.

Example :
```SQL
spark-sql> select array_remove(array(1,2,3), 1.23D);
   [2,3]
```
```SQL
spark-sql> select array_remove(array(1,2,3), 'foo');
NULL
```
We should safely coerce both left and right hand side expressions.
## How was this patch tested?
Added tests in DataFrameFunctionsSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25519

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22542.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22542


commit ac9cb1a2d2e5b73f6c6e89c510f84a1f10efb60e
Author: Dilip Biswal 
Date:   2018-09-16T02:36:18Z

[SPARK-25519] ArrayRemove function may return incorrect result when right 
expression is implicitly downcasted.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22407
  
@cloud-fan @ueshin Thank you very much !!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22198
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22484
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22326
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22455: [SPARK-24572][SPARKR] "eager execution" for R shell, IDE

2018-09-21 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22455
  
@adrian555 The changes look fine to me. Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-25454][SQL] add a new config for picking minimum ...

2018-09-21 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22494
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22515: [SPARK-19724][SQL] allowCreatingManagedTableUsingNonempt...

2018-09-21 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22515
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22458: [SPARK-25459] Add viewOriginalText back to CatalogTable

2018-09-21 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22458
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22316: [SPARK-25048][SQL] Pivoting by multiple columns in Scala...

2018-09-21 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22316
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-22036][SQL][followup] DECIMAL_OPERATIONS_ALLOW_PR...

2018-09-21 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22494
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22407
  
@ueshin Wenchen thought it may be risky to backport the fix to 
tighestCommonType. Given this, can this be looked at now ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22455#discussion_r219388335
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -244,11 +245,15 @@ setMethod("showDF",
 #' @note show(SparkDataFrame) since 1.4.0
 setMethod("show", "SparkDataFrame",
   function(object) {
-cols <- lapply(dtypes(object), function(l) {
-  paste(l, collapse = ":")
-})
-s <- paste(cols, collapse = ", ")
-cat(paste(class(object), "[", s, "]\n", sep = ""))
+if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", 
"false")[[1]], "true")) {
--- End diff --

@adrian555 Thanks for the explanation. 
> However, my second point is that I don't think these two configs matter 
much or that important/necessary. Since the eager execution is just to show a 
snippet data of the SparkDataFrame, our default numRows = 20 and truncate = 
TRUE are good enough iMO. If users want to see more or less number of rows, 
they should call showDF().

So i just wanted to make sure if its possible to have parity with how it 
works for python. It seems to me that in python, we just get the two configs 
and call the showstring method.

> And if we think that showDF() can ignore the eager execution setting and 
still want the show() to observe eager execution config, we can certainly just 
grab the maxNumRows and truncate setting and pass to showDF() call.

What will happen if we grab these config in show() when eager execution is 
enabled and then call showDF() by passing these parameters ? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22455#discussion_r219356224
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -244,11 +245,15 @@ setMethod("showDF",
 #' @note show(SparkDataFrame) since 1.4.0
 setMethod("show", "SparkDataFrame",
   function(object) {
-cols <- lapply(dtypes(object), function(l) {
-  paste(l, collapse = ":")
-})
-s <- paste(cols, collapse = ", ")
-cat(paste(class(object), "[", s, "]\n", sep = ""))
+if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", 
"false")[[1]], "true")) {
--- End diff --

@adrian555 Please correct me on this one. Should we also respect 
`spark.sql.repl.eagerEval.maxNumRows` and `spark.sql.repl.eagerEval.truncate` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-22036][SQL][followup] DECIMAL_OPERATIONS_ALLOW_PR...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22494
  
@cloud-fan The change looks fine to me. I looked at the failure. Its 
correctly switches to old behaviour when this config is set to off. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22455#discussion_r219330082
  
--- Diff: R/pkg/tests/fulltests/test_eager_execution.R ---
@@ -0,0 +1,58 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+library(testthat)
+
+context("Show SparkDataFrame when eager execution is enabled.")
+
+test_that("eager execution is not enabled", {
+  # Start Spark session without eager execution enabled
+  sparkSession <- if (windows_with_hadoop()) {
+sparkR.session(master = sparkRTestMaster)
+  } else {
+sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
+  }
+  
+  df <- createDataFrame(faithful)
+  expect_is(df, "SparkDataFrame")
+  expected <- "eruptions:double, waiting:double"
+  expect_output(show(df), expected)
+  
+  # Stop Spark session
+  sparkR.session.stop()
+})
+
+test_that("eager execution is enabled", {
+  # Start Spark session without eager execution enabled
--- End diff --

Here we are testing with eager execution enabled, right ? Do we need to fix 
the comment here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22450: [SPARK-25454][SQL] Avoid precision loss in division with...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22450
  
@mgaido91 
Again, this may be something to think about in 3.0 timeframe. I just 
checked two databases, presto and db2. Both of them treat literals such as 
`1e26` as double.
db2

```
db2 => describe select 1e26 from cast

 Column Information

 Number of columns: 1

 SQL type  Type length  Column name Name 
length
   ---  --  
---
 480   DOUBLE8  1   
  1

db2 => describe select 1.23 from cast

 Column Information

 Number of columns: 1

 SQL type  Type length  Column name Name 
length
   ---  --  
---
 484   DECIMAL3, 2  1   
  1
```

presto
=
```
presto:default> explain select 2.34E10;
Query Plan  
  

--
 - Output[_col0] => [expr:double]   
  
 Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}  
```
Should spark do the same ? What would be the repercussions if we did that ? 
   


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22408
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22481: Revert [SPARK-19355][SPARK-25352]

2018-09-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22481
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22408#discussion_r219039607
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   df.selectExpr("array_contains(array(1, null), array(1, null)[0])"),
   Seq(Row(true), Row(true))
 )
+
+checkAnswer(
+  df.selectExpr("array_contains(array(1), 1.23D)"),
+  Seq(Row(false), Row(false))
+)
+
+checkAnswer(
+  df.selectExpr("array_contains(array(1), 1.0D)"),
+  Seq(Row(true), Row(true))
+)
+
+checkAnswer(
+  df.selectExpr("array_contains(array(1.0D), 1)"),
+  Seq(Row(true), Row(true))
+)
+
+checkAnswer(
+  df.selectExpr("array_contains(array(1.23D), 1)"),
+  Seq(Row(false), Row(false))
+)
+
+checkAnswer(
+  df.selectExpr("array_contains(array(array(1)), array(1.0D))"),
+  Seq(Row(true), Row(true))
+)
+
+checkAnswer(
+  df.selectExpr("array_contains(array(array(1)), array(1.23D))"),
+  Seq(Row(false), Row(false))
+)
+
+checkAnswer(
+  df.selectExpr("array_contains(array(array(1)), array(1.23))"),
--- End diff --

@cloud-fan Yes. it should :-) I think i had changed this test case to 
verify the fix to tighestCommonType.. and pushed it by mistake. Sorry about it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22408#discussion_r219037732
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
 ## Upgrading From Spark SQL 2.3 to 2.4
 
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, both left and right parameters are  promoted 
to array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
--- End diff --

@cloud-fan Yeah, presto gives error. Please refer to my earlier comment 
showing the presto output. Did you want anything to change in the description ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22408#discussion_r219028470
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
 ## Upgrading From Spark SQL 2.3 to 2.4
 
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, both left and right parameters are  promoted 
to array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34);
--- End diff --

@cloud-fan OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22448: [SPARK-25417][SQL] Improve findTightestCommonType...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22448#discussion_r219019347
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -366,7 +366,29 @@ class TypeCoercionSuite extends AnalysisTest {
 // No up-casting for fixed-precision decimal (this is handled by 
arithmetic rules)
 widenTest(DecimalType(2, 1), DecimalType(3, 2), None)
 widenTest(DecimalType(2, 1), DoubleType, None)
-widenTest(DecimalType(2, 1), IntegerType, None)
+widenTest(DecimalType(2, 1), IntegerType, Some(DecimalType(11, 1)))
--- End diff --

@cloud-fan OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22448: [SPARK-25417][SQL] Improve findTightestCommonType...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22448#discussion_r219019110
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -106,6 +107,22 @@ object TypeCoercion {
 case (t1, t2) => findTypeForComplex(t1, t2, findTightestCommonType)
   }
 
+  /**
+   * Finds a wider decimal type between the two supplied decimal types 
without
+   * any loss of precision.
+   */
+  def findWiderDecimalType(d1: DecimalType, d2: DecimalType): 
Option[DecimalType] = {
--- End diff --

@cloud-fan Sure.. will do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22448: [SPARK-25417][SQL] Improve findTightestCommonType...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22448#discussion_r219018635
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -89,10 +90,10 @@ object TypeCoercion {
 case (NullType, t1) => Some(t1)
 case (t1, NullType) => Some(t1)
 
-case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
-  Some(t2)
-case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
-  Some(t1)
+case (t1: IntegralType, t2: DecimalType) =>
+  findWiderDecimalType(DecimalType.forType(t1), t2)
+case (t1: DecimalType, t2: IntegralType) =>
+  findWiderDecimalType(t1, DecimalType.forType(t2))
 
 // Promote numeric types to the highest of the two
 case (t1: NumericType, t2: NumericType)
--- End diff --

@cloud-fan Yeah.. Do you think it may conflict with DecimalConversion rule 
in anyway ? Let me run the tests first and see how it goes ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22448
  
@cloud-fan does this look okay now ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22470: [SPARK-25454][SQL] should not generate negative scale as...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22470
  
@mgaido91 makes sense. Actually @cloud-fan had asked me to write some test 
cases for decimal values with -ve scale in another PR. While i was playing 
around, i found this issue. It seemed to me that the whole -ve scale thingy is 
not thought through properly. Thats the reason i wanted to share my findings 
here while you guys are discussing this :-).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22470: [SPARK-25454][SQL] should not generate negative scale as...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22470
  
@mgaido91 @cloud-fan On the other hand .. some use cases may work better 
:-) , for example
Before 
```
scala> spark.sql("create table dec as select (1e36 * 1) as col1")
org.apache.spark.SparkException: Cannot recognize hive type string: 
decimal(3,-36)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:883)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:905)

with this pr
```
scala> spark.sql("create table dec as select (1e36 * 1) as col1")
18/09/19 14:29:29 WARN HiveMetaStore: Location: 
file:/user/hive/warehouse/dec specified for non-external table:dec
18/09/19 14:29:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("describe table dec").show
++-+---+
|col_name|data_type|comment|
++-+---+
|col1|decimal(38,0)|   null|
++-+---+
```
Perhaps we may have issues writing out data frames containing decimal with 
negative scale to file based datasources as well. I have not verified though. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22448: [SPARK-25417][SQL] Improve findTightestCommonType to coe...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22448
  
@MaxGekk I was looking at CSVInferSchema. It seems like there is a copy of 
`findTightestCommonType` in this file ? Do you know the reason for it ? Seems 
like we may need to refactor to see if we can avoid duplicating 
`findTightestCommonType` here ? Can we take this in a follow-up ? If you think 
we should only focus on refactoring `findWiderDecimalType` and not worry about 
`findTightestCommonType` then please let me know and i can do it in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22470: [SPARK-25454][SQL] should not generate negative scale as...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22470
  
@cloud-fan Could you please check CSVinferSchema::tryParseDecimal() ? There 
is a condition to check negative scale.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/21632
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22462
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22447: [SPARK-25450][SQL] PushProjectThroughUnion rule uses the...

2018-09-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22447
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >