It should at best collect 1 item to the driver. This means evaluating
at least 1 element of 1 partition. I can imagine pathological cases
where that's slow, but, do you have any more info? how slow is slow
and what is slow?

On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> I’m getting *huge* execution times on a moderate sized dataset during the
> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
> calculation. I’m using Spark 1.5.1 and from researching I would expect this
> calculation to be linearly proportional to the number of partitions as a
> worst case, which should be a trivial amount of time but it is taking many
> minutes to hours to complete this single phase.
>
> I know that has been a small amount of discussion about using this so would
> love to hear what the current thinking on the subject is. Is there a better
> way to find if an RDD has data? Can someone explain why this is happening?
>
> reference PR
> https://github.com/apache/spark/pull/4534

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to