Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Sean Owen
On Fri, Jul 11, 2014 at 10:53 PM, bdamos wrote: > I didn't make it clear in my first message that I want to obtain an RDD > instead > of an Iterable, and will be doing map-reduce like operations on the > data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])], > but I really want

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
limits my system to cleanly work for a single timezone. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9464.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
> I think my best option is to partition my data in directories by day > before running my Spark application, and then direct > my Spark application to load RDD's from each directory when > I want to load a date range. How does this sound? > > If your upstream system can write data by day then it m

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Sean Owen
et of data by day? > > Thanks, > Brandon. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
he-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9459.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer "may" yield better results. It's worth a try at least. On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta wrote: > > > > >> >> Solution 2 is to map the objects into a pai

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
> > Solution 2 is to map the objects into a pair RDD where the > key is the number of the day in the interval, then group by > key, collect, and parallelize the resulting grouped data. > However, I worry collecting large data sets is going to be > a serious performance bottleneck. > > Why do you ha

How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
can be further improved. Does anybody have any suggestions on the best way to separate a subset of data by day? Thanks, Brandon. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html Sent from the Apache Spark