[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16971


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r106569149
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala
 ---
@@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite {
   }
 
   private def checkQuantile(quant: Double, data: Seq[Double], summary: 
QuantileSummaries): Unit = {
-val approx = summary.query(quant)
+val approx = summary.query(quant).get
--- End diff --

Ok, I will add a test on empty `data` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r106569064
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Thanks all. I will add a comment here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r106335040
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r106309333
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Yes, this is correct. But you should leave a comment, since it is not 
obvious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r105588823
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

cc @thunterdb to ensure it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r105588798
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala
 ---
@@ -55,7 +55,7 @@ class QuantileSummariesSuite extends SparkFunSuite {
   }
 
   private def checkQuantile(quant: Double, data: Seq[Double], summary: 
QuantileSummaries): Unit = {
-val approx = summary.query(quant)
+val approx = summary.query(quant).get
--- End diff --

Add a test case with `summary.count == 0` and improve this helper function 
to cover it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r105588596
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Yes. I think it is impossible to hit it. We need some test cases to ensure 
it. See my next comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-02 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103872828
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

It looks like that it is impossible not return `None` here: Since 
`summaries.count != 0`, the `summaries.sampled` should not be empty, then 
`None` will not be returned. @thunterdb Is this correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103844548
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Is it possible to return `None`? Then, you will get a strange exception. 
Could you also add a test case for `getPercentiles` in 
`ApproximatePercentileQuerySuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103844105
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -91,7 +99,9 @@ object StatFunctions extends Logging {
 }
 val summaries = df.select(columns: 
_*).rdd.aggregate(emptySummaries)(apply, merge)
 
-summaries.map { summary => probabilities.map(summary.query) }
+summaries.map { summary =>
+  probabilities.flatMap(summary.query)
+}
--- End diff --

Nit: 
```
summaries.map { summary => probabilities.flatMap(summary.query) }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103844048
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -78,7 +81,12 @@ object StatFunctions extends Logging {
 def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
   var i = 0
   while (i < summaries.length) {
-summaries(i) = summaries(i).insert(row.getDouble(i))
+if (!row.isNullAt(i)) {
+  val v = row.getDouble(i)
+  if (!v.isNaN) {
+summaries(i) = summaries(i).insert(v)
+  }
--- End diff --

Nit:
```
if (!v.isNaN) summaries(i) = summaries(i).insert(v)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103844007
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala
 ---
@@ -176,17 +176,21 @@ class QuantileSummaries(
* @param quantile the target quantile
* @return
*/
-  def query(quantile: Double): Double = {
+  def query(quantile: Double): Option[Double] = {
 require(quantile >= 0 && quantile <= 1.0, "quantile should be in the 
range [0.0, 1.0]")
 require(headSampled.isEmpty,
   "Cannot operate on an uncompressed summary, call compress() first")
 
+if (sampled.isEmpty) {
+  return None
+}
--- End diff --

Nit: 
```
if (sampled.isEmpty) return None
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103843868
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,22 +88,17 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any null or NaN values will be removed before 
calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
+   * @note null and NaN values will be ignored in numerical columns before 
calculation. For
+   *   columns only containing null or NaN values, an empty array is 
returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-// TODO: Update NaN/null handling to keep consistent with the 
single-column version
-try {
-  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
-probabilities, relativeError).map(_.toArray).toArray
-} catch {
-  case e: NoSuchElementException => null
-}
+StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), 
cols,
+  probabilities, relativeError).map(_.toArray).toArray
--- End diff --

Nit: style issue
```
StatFunctions.multipleApproxQuantiles(
  df.select(cols.map(col): _*),
  cols,
  probabilities,
  relativeError).map(_.toArray).toArray
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r103270963
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -91,7 +100,13 @@ object StatFunctions extends Logging {
 }
 val summaries = df.select(columns: 
_*).rdd.aggregate(emptySummaries)(apply, merge)
 
-summaries.map { summary => probabilities.map(summary.query) }
+summaries.map { summary =>
+  try {
+probabilities.map(summary.query)
+  } catch {
+case e: SparkException => Seq.empty[Double]
--- End diff --

Please do not use the Exception handling for this purpose. Instead, you can 
return None. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102601793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any null or NaN values will be removed before 
calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty, or all rows in some column contain null or 
NaN, null is returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-// TODO: Update NaN/null handling to keep consistent with the 
single-column version
 try {
-  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
+  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), 
cols,
 probabilities, relativeError).map(_.toArray).toArray
 } catch {
   case e: NoSuchElementException => null
--- End diff --

In Spark SQL, all the other built-in functions will not throw an exception 
if the input data set is empty. An empty inupt data set is pretty normal. 
Returning either `null` or empty `Array` looks ok to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102592174
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -78,7 +80,12 @@ object StatFunctions extends Logging {
 def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
   var i = 0
   while (i < summaries.length) {
-summaries(i) = summaries(i).insert(row.getDouble(i))
+if (!row.isNullAt(i)) {
--- End diff --

Thank you for fixing this issue, it was an oversight in my original 
implementation. 

The current exception being thrown depends on an implementation detail 
(calling `sampled.head`). Can you modify the function `def query` below to 
explicitly throw an exception if `sampled` is empty, and document this behavior 
in that function? This way, we will not forget it if decide to change the 
semantics of that class.

As @MLnick was mentioning above, it would be preferrable to either return 
an Option, null or NaN eventually, but this can wait for more consensus.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102589719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any null or NaN values will be removed before 
calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty, or all rows in some column contain null or 
NaN, null is returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-// TODO: Update NaN/null handling to keep consistent with the 
single-column version
 try {
-  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
+  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), 
cols,
 probabilities, relativeError).map(_.toArray).toArray
 } catch {
   case e: NoSuchElementException => null
--- End diff --

+1. I tend to think that the result should be `NaN` (following the IEEE 
convention) or `null` (following scala Option convention). But pending a 
resolution, I am fine with throwing an exception because it is the most 
conservative behavior (stopping computations).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-21 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102167055
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -214,20 +214,29 @@ class DataFrameStatSuite extends QueryTest with 
SharedSQLContext {
 val q1 = 0.5
 val q2 = 0.8
 val epsilon = 0.1
-val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), 
Row(1.0, 1.0),
+val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), 
Row(1.0, -1.0),
   Row(-1.0, Double.NaN), Row(Double.NaN, Double.NaN), Row(null, null), 
Row(null, 1.0),
   Row(-1.0, null), Row(Double.NaN, null)))
 val schema = StructType(Seq(StructField("input1", DoubleType, nullable 
= true),
   StructField("input2", DoubleType, nullable = true)))
 val dfNaN = spark.createDataFrame(rows, schema)
-val resNaN = dfNaN.stat.approxQuantile("input1", Array(q1, q2), 
epsilon)
-assert(resNaN.count(_.isNaN) === 0)
-assert(resNaN.count(_ == null) === 0)
+val resNaN1 = dfNaN.stat.approxQuantile("input1", Array(q1, q2), 
epsilon)
+assert(resNaN1.count(_.isNaN) === 0)
+assert(resNaN1.count(_ == null) === 0)
 
-val resNaN2 = dfNaN.stat.approxQuantile(Array("input1", "input2"),
+val resNaN2 = dfNaN.stat.approxQuantile("input2", Array(q1, q2), 
epsilon)
+assert(resNaN2.count(_.isNaN) === 0)
+assert(resNaN2.count(_ == null) === 0)
+
+val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2"),
   Array(q1, q2), epsilon)
-assert(resNaN2.flatten.count(_.isNaN) === 0)
-assert(resNaN2.flatten.count(_ == null) === 0)
+assert(resNaNAll.flatten.count(_.isNaN) === 0)
+assert(resNaNAll.flatten.count(_ == null) === 0)
+
+assert(resNaN1(0) === resNaNAll(0)(0))
+assert(resNaN1(1) === resNaNAll(0)(1))
+assert(resNaN2(0) === resNaNAll(1)(0))
+assert(resNaN2(1) === resNaNAll(1)(1))
--- End diff --

Yes, I create a new column containing only NaN/null.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102146260
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -78,7 +80,13 @@ object StatFunctions extends Logging {
 def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
   var i = 0
   while (i < summaries.length) {
-summaries(i) = summaries(i).insert(row.getDouble(i))
+val item = row(i)
--- End diff --

This works, though perhaps we can do:

```scala
if (!row.isNullAt(i)) {
  val v = row.getDouble(i)
  if (!v.isNaN) {
summaries(i) = summaries(i).insert(v)
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102145908
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any null or NaN values will be removed before 
calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty, or all rows in some column contain null or 
NaN, null is returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-// TODO: Update NaN/null handling to keep consistent with the 
single-column version
 try {
-  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
+  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), 
cols,
 probabilities, relativeError).map(_.toArray).toArray
 } catch {
   case e: NoSuchElementException => null
--- End diff --

This went in for the other PR but I still question whether we should be 
returning `null` here. Is this standard in SparkSQL? What about returning an 
empty `Array`? cc @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102145412
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -54,6 +54,8 @@ object StatFunctions extends Logging {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
*
* @return for each column, returns the requested approximations
+   *
+   * @note null and NaN values will be removed from the numerical column 
before calculation.
--- End diff --

I think "will be ignored" is more accurate than "will be removed"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102145538
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any null or NaN values will be removed before 
calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
--- End diff --

Again, "ignored" is slightly better than "removed from"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102146144
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -214,20 +214,29 @@ class DataFrameStatSuite extends QueryTest with 
SharedSQLContext {
 val q1 = 0.5
 val q2 = 0.8
 val epsilon = 0.1
-val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), 
Row(1.0, 1.0),
+val rows = spark.sparkContext.parallelize(Seq(Row(Double.NaN, 1.0), 
Row(1.0, -1.0),
   Row(-1.0, Double.NaN), Row(Double.NaN, Double.NaN), Row(null, null), 
Row(null, 1.0),
   Row(-1.0, null), Row(Double.NaN, null)))
 val schema = StructType(Seq(StructField("input1", DoubleType, nullable 
= true),
   StructField("input2", DoubleType, nullable = true)))
 val dfNaN = spark.createDataFrame(rows, schema)
-val resNaN = dfNaN.stat.approxQuantile("input1", Array(q1, q2), 
epsilon)
-assert(resNaN.count(_.isNaN) === 0)
-assert(resNaN.count(_ == null) === 0)
+val resNaN1 = dfNaN.stat.approxQuantile("input1", Array(q1, q2), 
epsilon)
+assert(resNaN1.count(_.isNaN) === 0)
+assert(resNaN1.count(_ == null) === 0)
 
-val resNaN2 = dfNaN.stat.approxQuantile(Array("input1", "input2"),
+val resNaN2 = dfNaN.stat.approxQuantile("input2", Array(q1, q2), 
epsilon)
+assert(resNaN2.count(_.isNaN) === 0)
+assert(resNaN2.count(_ == null) === 0)
+
+val resNaNAll = dfNaN.stat.approxQuantile(Array("input1", "input2"),
   Array(q1, q2), epsilon)
-assert(resNaN2.flatten.count(_.isNaN) === 0)
-assert(resNaN2.flatten.count(_ == null) === 0)
+assert(resNaNAll.flatten.count(_.isNaN) === 0)
+assert(resNaNAll.flatten.count(_ == null) === 0)
+
+assert(resNaN1(0) === resNaNAll(0)(0))
+assert(resNaN1(1) === resNaNAll(0)(1))
+assert(resNaN2(0) === resNaNAll(1)(0))
+assert(resNaN2(1) === resNaNAll(1)(1))
--- End diff --

Do we need a test for one column all nulls (that it returns null)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-16 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/16971

[SPARK-19573][SQL] Make NaN/null handling consistent in approxQuantile

## What changes were proposed in this pull request?
update `StatFunctions.multipleApproxQuantiles` to handle NaN/null

(Please fill in changes proposed in this fix)

## How was this patch tested?
existing tests and added tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark quantiles_nan

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16971.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16971


commit d5e79a809b1edd91a7e0c1d8046bb8bfec2ba4c9
Author: Zheng RuiFeng 
Date:   2017-02-17T03:18:43Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org