Re: Compact RDD representation

2015-07-20 Thread Juan Rodríguez Hortalá
Hi, I'm not an authority in the Spark community, but what I would do is adding the project to spark packages http://spark-packages.org/. In fact I think this case is similar to IndexedRDD, which is also in spark packages http://spark-packages.org/package/amplab/spark-indexedrdd 2015-07-19 21:49 G

Re: Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Hi Juan, It's exactly what I meant. if we will have high load with many repetitions it can significantly reduce rdd size and improve performance. in real use cases application frequently need to enrich data from cache or external system, so we will save time on each repetition. I will also do some

Re: Compact RDD representation

2015-07-19 Thread Juan Rodríguez Hortalá
Hi, My two cents is that that could be interesting if all RDD and pair RDD operations would be lifted to work on groupedRDD. For example as suggested a map on grouped RDDs would be more efficient if the original RDD had lots of duplicate entries, but for RDDs with little repetitions I guess you in

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
In the Spark model, constructing an RDD does not mean storing all its contents in memory. Rather, an RDD is a description of a dataset that enables iterating over its contents, record by record (in parallel). The only time the full contents of an RDD are stored in memory is when a user explicitly

Re: Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Sorry, maybe I am saying something completely wrong... we have a stream, we digitize it to created rdd. rdd in this case will be just array of any. than we apply transformation to create new grouped rdd and GC should remove original rdd from memory(if we won't persist it). Will we have GC step in

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
The user gets to choose what they want to reside in memory. If they call rdd.cache() on the original RDD, it will be in memory. If they call rdd.cache() on the compact RDD, it will be in memory. If cache() is called on both, they'll both be in memory. -Sandy On Sun, Jul 19, 2015 at 11:09 AM, С

Re: Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Thanks for answer! Could you please answer for one more question? Will we have in memory original rdd and grouped rdd in the same time? 2015-07-19 21:04 GMT+03:00 Sandy Ryza : > Edit: the first line should read: > > val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) > > On Sun, Jul 19, 2015 at

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
This functionality already basically exists in Spark. To create the "grouped RDD", one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x => List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман wr

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza wrote: > This functionality already basically exists in Spark. To create the > "grouped RDD", one can run: > > val groupedRdd = rdd.reduceByKey(_ + _) > > To g

Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store "Spark", "Spark", "Spark", instead store ("Spark", 3)) can: 1. Reduce memory needed for RDD (