[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20800


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-14 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r188019259
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

In my understanding, the @maropu's suggestion is to implement isEmpty 
exactly the same way of count() method except the return type. 

```
def count(): Long = withAction("count", groupBy().count().queryExecution) { 
plan =>
   plan.executeCollect().head.getLong(0)
 }

def isEmpty: Boolean = withAction("isEmpty", 
groupBy().count().queryExecution) { plan =>
  plan.executeCollect().head.getLong(0) == 0
}
```
And @cloud-fan 's idea is adding limit(1) before doing count! Is it right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-14 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r187864972
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

+1 for the limit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r187831971
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

`RDD#isEmpty` is pretty effective, it just checks if all the partitions are 
empty, without loading the data. The problem is how to build an RDD from 
`Dataset`, which minimize the cost of building the `Iterator`.

It seems `Dataset#rdd` is not good enough, e.g., if we have a `Filter` in 
the query, we may do a full scan(no column pruning) for the underlying files.

Doing a count is not perfect either. Ideally we can stop as soon as we see 
one record.

I'd suggest doing a `limit 1` first and then count.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-13 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r187820731
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

If it's worth doing, I'll take (if nobody does this...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r187804796
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

@goungoun maybe; it depends on the Dataset. I'd follow @maropu 's lead or 
close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-23 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r176728379
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

@gatorsmile, just simply running df.rdd.isEmpty in spark-shell was quite 
responsive even in tera byte sized tables. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-16 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r175039379
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

How about this?
```
  def isEmpty: Boolean = withAction("isEmpty", 
groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r175018339
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

Building a rdd is not cheap. The current impl does not perform well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-14 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174673621
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty
--- End diff --

@HyukjinKwon, thanks. I added period.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174665402
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty
--- End diff --

(dot at the end)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet

2018-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174016939
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `DataSet` is empty
--- End diff --

Dataset


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet

2018-03-12 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174002184
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,12 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `DataSet` is empty
+   *
--- End diff --

@mgaido91 Thanks, I modified the comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet

2018-03-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r173812290
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,12 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `DataSet` is empty
+   *
--- End diff --

this would need a 
```
@since 2.4.0
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org