Thank you! That helps.

A follow up question on this. How can I apply a function only on a subset
of this RDD. Lets say, I need all strings starting in the range 'A' - 'M'
be applied toUpperCase and not touch the remaining. Is that possible
without running an 'if' condition on all the partitions in the cluster?


On Tue, Jan 28, 2014 at 1:20 PM, Mark Hamstra <[email protected]>wrote:

> scala> import org.apache.spark.RangePartitioner
>
> scala> val rdd = sc.parallelize(List("apple", "Ball", "cat", "dog",
> "Elephant", "fox", "gas", "horse", "index", "jet", "kitsch", "long",
> "moon", "Neptune", "ooze", "Pen", "quiet", "rose", "sun", "talk",
> "umbrella", "voice", "Walrus", "xeon", "Yam", "zebra"))
>
> scala> rdd.keyBy(s => s(0).toUpper)
> res0: org.apache.spark.rdd.RDD[(Char, String)] = MappedRDD[1] at keyBy at
> <console>:15
>
> scala> res0.partitionBy(new RangePartitioner[Char, String](26,
> res0)).values
> res2: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at
> <console>:18
>
> scala> res2.mapPartitionsWithIndex((idx, itr) => itr.map(s => (idx,
> s))).collect.foreach(println)
>
> (0,apple)
> (1,Ball)
> (2,cat)
> (3,dog)
> (4,Elephant)
> (5,fox)
> (6,gas)
> (7,horse)
> (8,index)
> (9,jet)
> (10,kitsch)
> (11,long)
> (12,moon)
> (13,Neptune)
> (14,ooze)
> (15,Pen)
> (16,quiet)
> (17,rose)
> (18,sun)
> (19,talk)
> (20,umbrella)
> (21,voice)
> (22,Walrus)
> (23,xeon)
> (24,Yam)
> (25,zebra)
>
>
>
> On Tue, Jan 28, 2014 at 11:48 AM, Nick Pentreath <[email protected]
> > wrote:
>
>> If you do something like:
>>
>> rdd.map{ str => (str.take(1), str) }
>>
>> you will have an RDD[(String, String)] where the key is the first
>> character of the string. Now when you perform an operation that uses
>> partitioning (e.g. reduceByKey) you will end up with the 1st reduce task
>> receiving all the strings with A, the 2nd all the strings with B etc. Note
>> that you may not be able to enforce that each *machine* gets a different
>> letter, but in most cases that doesn't particularly matter as long as you
>> get "all values for a given key go to the same reducer" behaviour.
>>
>> Perhaps if you expand on your use case we can provide more detailed
>> assistance.
>>
>>
>> On Tue, Jan 28, 2014 at 9:35 PM, David Thomas <[email protected]>wrote:
>>
>>> Lets say I have an RDD of Strings and there are 26 machines in the
>>> cluster. How can I repartition the RDD in such a way that all strings
>>> starting with A gets collected on machine1, B on machine2 and so on.
>>>
>>>
>>
>

Reply via email to