On Fri, Jul 11, 2014 at 10:53 PM, bdamos wrote:
> I didn't make it clear in my first message that I want to obtain an RDD
> instead
> of an Iterable, and will be doing map-reduce like operations on the
> data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])],
> but I really want
limits my system to cleanly work for a single timezone.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9464.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
> I think my best option is to partition my data in directories by day
> before running my Spark application, and then direct
> my Spark application to load RDD's from each directory when
> I want to load a date range. How does this sound?
>
> If your upstream system can write data by day then it m
et of data by day?
>
> Thanks,
> Brandon.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
he-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454p9459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer "may" yield
better results. It's worth a try at least.
On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta
wrote:
>
>
>
>
>>
>> Solution 2 is to map the objects into a pai
>
> Solution 2 is to map the objects into a pair RDD where the
> key is the number of the day in the interval, then group by
> key, collect, and parallelize the resulting grouped data.
> However, I worry collecting large data sets is going to be
> a serious performance bottleneck.
>
>
Why do you ha
can be further improved.
Does anybody have any suggestions on the best way to separate
a subset of data by day?
Thanks,
Brandon.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html
Sent from the Apache Spark