Re: How to separate a subset of an RDD by day?

bdamos Fri, 11 Jul 2014 14:38:22 -0700

ssimanta wrote
>> Solution 2 is to map the objects into a pair RDD where the
>> key is the number of the day in the interval, then group by
>> key, collect, and parallelize the resulting grouped data.
>> However, I worry collecting large data sets is going to be
>> a serious performance bottleneck.
> Why do you have to do a "collect" ?  You can do a groupBy and then write
> the grouped data to disk again


I want to process the resulting data sets as RDD's,
and groupBy only returns the data as Seq.
Thanks on the idea to write the grouped data back to disk.
I think my best option is to partition my data in directories by day
before running my Spark application, and then direct
my Spark application to load RDD's from each directory when
I want to load a date range. How does this sound?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

Reply via email to