It should at best collect 1 item to the driver. This means evaluating at least 1 element of 1 partition. I can imagine pathological cases where that's slow, but, do you have any more info? how slow is slow and what is slow?
On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I’m getting *huge* execution times on a moderate sized dataset during the > RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty > calculation. I’m using Spark 1.5.1 and from researching I would expect this > calculation to be linearly proportional to the number of partitions as a > worst case, which should be a trivial amount of time but it is taking many > minutes to hours to complete this single phase. > > I know that has been a small amount of discussion about using this so would > love to hear what the current thinking on the subject is. Is there a better > way to find if an RDD has data? Can someone explain why this is happening? > > reference PR > https://github.com/apache/spark/pull/4534 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org