Sure, I thought this might be a known issue. I have a 122M dataset, which is the trust and rating data from epinions. The data is split into two RDDs and there is an item properties RDD. The code is just trying to remove any empty RDD from the list.
val esRDDs: List[RDD[(String, Map[String, Any])]] = (correlators ::: properties).filterNot( c => c.isEmpty()) On my 16G MBP with 4g per executor and 4 executors the IsEmpty takes over a hundred minutes (going from memory, I can supply the timeline given a few hours to recalc it). Running a different version of the code that does a .count for debug and .take(1) instead of the .isEmpty the count of one epinions RDD take 8 minutes and the .take(1) uses 3 minutes. Other users have seen total runtime on 13G dataset of 700 minutes with the execution time mostly spent in isEmpty. On Dec 9, 2015, at 8:50 AM, Sean Owen <so...@cloudera.com> wrote: It should at best collect 1 item to the driver. This means evaluating at least 1 element of 1 partition. I can imagine pathological cases where that's slow, but, do you have any more info? how slow is slow and what is slow? On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I’m getting *huge* execution times on a moderate sized dataset during the > RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty > calculation. I’m using Spark 1.5.1 and from researching I would expect this > calculation to be linearly proportional to the number of partitions as a > worst case, which should be a trivial amount of time but it is taking many > minutes to hours to complete this single phase. > > I know that has been a small amount of discussion about using this so would > love to hear what the current thinking on the subject is. Is there a better > way to find if an RDD has data? Can someone explain why this is happening? > > reference PR > https://github.com/apache/spark/pull/4534 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org