On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> The “Any” is required by the code it is being passed to, which is the
> Elasticsearch Spark index writing code. The values are actually RDD[(String,
> Map[String, String])]

(Is it frequently a big big map by any chance?)

> No shuffle that I know of. RDDs are created from the output of Mahout
> SimilarityAnalysis.cooccurrence and are turned into RDD[(String, Map[String,
> String])], Since all actual values are simple there is no serialization
> except for standard Java/Scala so no custom serializers or use of Kryo.

It's still worth looking at the stages in the job.


> I understand that the driver can’t know, I was suggesting that isEmpty could
> be backed by a boolean RDD member variable calculated for every RDD at
> creation time in Spark. This is really the only way to solve generally since
> sometimes you get an RDD from a lib, so wrapping it as I suggested is not
> practical, it would have to be in Spark. BTW the same approach could be used
> for count, an accumulator per RDD, then returned as a pre-calculated RDD
> state value.

What would the boolean do? you can't cache the size in general even if
you know it, but you don't know it at the time the RDD is created
(before it's evaluated).


> Are you suggesting that both take(1) and isEmpty are unusable for some
> reason in my case? I can pass around this information if I have to, I just
> thought the worst case was O(n) where n was number of partitions and
> therefor always trivial to calculate.

No, just trying to guess at reasons you observe what you do. There's
no difference between isEmpty and take(1) if there are > 0 partitions,
so if they behave very differently it's something to do with what
you're measuring and how.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to