That has been done Sir and represents further optimizations – the objective here was to confirm whether cogroup always results in the previously described “greedy” explosion of the number of elements included and RAM allocated for the result RDD
The optimizations mentioned still don’t change the total number of elements included in the result RDD and RAM allocated – right? From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:25 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Significant optimizations can be made by doing the joining/cogroup in a smart way. If you have to join streaming RDDs with the same batch RDD, then you can first partition the batch RDDs using a partitions and cache it, and then use the same partitioner on the streaming RDDs. That would make sure that the large batch RDDs is not partitioned repeatedly for the cogroup, only the small streaming RDDs are partitioned. HTH TD On Wed, Apr 15, 2015 at 1:11 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: There are indications that joins in Spark are implemented with / based on the cogroup function/primitive/transform. So let me focus first on cogroup - it returns a result which is RDD consisting of essentially ALL elements of the cogrouped RDDs. Said in another way - for every key in each of the cogrouped RDDs there is at least one element from at least one of the cogrouped RDDs. That would mean that when smaller, moreover streaming e.g. JavaPairDstreamRDDs keep getting joined with much larger, batch RDD that would result in RAM allocated for multiple instances of the result (cogrouped) RDD a.k.a essentially the large batch RDD and some more ... Obviously the RAM will get returned when the DStream RDDs get discard and they do on a regular basis, but still that seems as unnecessary spike in the RAM consumption I have two questions: 1.Is there anyway to control the cogroup process more "precisely" e.g. tell it to include I the cogrouped RDD only elements where there are at least one element from EACH of the cogrouped RDDs per given key. Based on the current cogroup API this is not possible 2.If the cogroup is really such a sledgehammer and secondly the joins are based on cogroup then even though they can present a prettier picture in terms of the end result visible to the end user does that mean that under the hood there is still the same atrocious RAM consumption going on -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RAM-management-during-cogroup-and-join-tp22505.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org