Sure, I thought this might be a known issue.

I have a 122M dataset, which is the trust and rating data from epinions. The 
data is split into two RDDs and there is an item properties RDD. The code is 
just trying to remove any empty RDD from the list.

val esRDDs: List[RDD[(String, Map[String, Any])]] =
  (correlators ::: properties).filterNot( c => c.isEmpty())

On my 16G MBP with 4g per executor and 4 executors the IsEmpty takes over a 
hundred minutes (going from memory, I can supply the timeline given a few hours 
to recalc it). 

Running a different version of the code that does a .count for debug and 
.take(1) instead of the .isEmpty the count of one epinions RDD take 8 minutes 
and the .take(1) uses 3 minutes.

Other users have seen total runtime on 13G dataset of 700 minutes with the 
execution time mostly spent in isEmpty.


On Dec 9, 2015, at 8:50 AM, Sean Owen <so...@cloudera.com> wrote:

It should at best collect 1 item to the driver. This means evaluating
at least 1 element of 1 partition. I can imagine pathological cases
where that's slow, but, do you have any more info? how slow is slow
and what is slow?

On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> I’m getting *huge* execution times on a moderate sized dataset during the
> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
> calculation. I’m using Spark 1.5.1 and from researching I would expect this
> calculation to be linearly proportional to the number of partitions as a
> worst case, which should be a trivial amount of time but it is taking many
> minutes to hours to complete this single phase.
> 
> I know that has been a small amount of discussion about using this so would
> love to hear what the current thinking on the subject is. Is there a better
> way to find if an RDD has data? Can someone explain why this is happening?
> 
> reference PR
> https://github.com/apache/spark/pull/4534

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to