There are indications that joins in Spark are implemented with / based on the
cogroup function/primitive/transform. So let me focus first on cogroup - it
returns a result which is RDD consisting of essentially ALL elements of the
cogrouped RDDs. Said in another way - for every key in each of the cogrouped
RDDs there is at least one element from at least one of the cogrouped RDDs. 

That would mean that when smaller, moreover streaming e.g.
JavaPairDstreamRDDs keep getting joined with much larger, batch RDD that
would result in RAM allocated for multiple instances of the result
(cogrouped) RDD a.k.a essentially the large batch RDD and some more ...
Obviously the RAM will get returned when the DStream RDDs get discard and
they do on a regular basis, but still that seems as unnecessary spike in the
RAM consumption 

I have two questions: 

1.Is there anyway to control the cogroup process more "precisely" e.g. tell
it to include I the cogrouped RDD only elements where there are at least one
element from EACH of the cogrouped RDDs per given key. Based on the current
cogroup API this is not possible 


2.If the cogroup is really such a sledgehammer and secondly the joins are
based on cogroup then even though they can present a prettier picture in
terms of the end result visible to the end user does that mean that under
the hood there is still the same atrocious RAM consumption going on 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RAM-management-during-cogroup-and-join-tp22505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to