ssimanta wrote >> Solution 2 is to map the objects into a pair RDD where the >> key is the number of the day in the interval, then group by >> key, collect, and parallelize the resulting grouped data. >> However, I worry collecting large data sets is going to be >> a serious performance bottleneck. > Why do you have to do a "collect" ? You can do a groupBy and then write > the grouped data to disk again
I want to process the resulting data sets as RDD's, and groupBy only returns the data as Seq. Thanks on the idea to write the grouped data back to disk. I think my best option is to partition my data in directories by day before running my Spark application, and then direct my Spark application to load RDD's from each directory when I want to load a date range. How does this sound? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9459.html Sent from the Apache Spark User List mailing list archive at Nabble.com.