Thanks Daniel :-). It seems to make sense and something I was hoping for. I
will proceed with this assumption and report back if I see any anomalies.

On Wed Nov 19 2014 at 19:30:02 Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

> Ah, so I misunderstood you too :).
>
> My reading of org/ apache/spark/Aggregator.scala is that your function
> will always see the items in the order that they are in the input RDD. An
> RDD partition is always accessed as an iterator, so it will not be read out
> of order.
>
> On Wed, Nov 19, 2014 at 2:28 PM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Thanks Daniel. I can understand that the keys will not be in sorted order
>> but what I am trying to understanding is whether the functions are passed
>> values in sorted order in a given partition.
>>
>> For example:
>>
>> sc.parallelize(1 to 8).map(i => (1, i)).sortBy(t =>
>> t._2).foldByKey(0)((a, b) => b).collect
>> res0: Array[(Int, Int)] = Array((1,8))
>>
>> The fold always given me last value as 8 which suggests values preserve
>> sorting earlier defined in stage in DAG?
>>
>> On Wed Nov 19 2014 at 18:10:11 Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> Akhil, I think Aniket uses the word "persisted" in a different way than
>>> what you mean. I.e. not in the RDD.persist() way. Aniket asks if running
>>> combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting
>>> is preserved.)
>>>
>>> I think the answer is no. combineByKey uses AppendOnlyMap, which is a
>>> hashmap. That will shuffle your keys. You can quickly verify it in
>>> spark-shell:
>>>
>>> scala> sc.parallelize(7 to 8).map(_ -> 1).reduceByKey(_ + _).collect
>>> res0: Array[(Int, Int)] = Array((8,1), (7,1))
>>>
>>> (The initial size of the AppendOnlyMap seems to be 8, so 8 is the first
>>> number that demonstrates this.)
>>>
>>> On Wed, Nov 19, 2014 at 9:05 AM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> If something is persisted you can easily see them under the Storage tab
>>>> in the web ui.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, Nov 18, 2014 at 7:26 PM, Aniket Bhatnagar <
>>>> aniket.bhatna...@gmail.com> wrote:
>>>>
>>>>> I am trying to figure out if sorting is persisted after applying Pair
>>>>> RDD transformations and I am not able to decisively tell after reading the
>>>>> documentation.
>>>>>
>>>>> For example:
>>>>> val numbers = .. // RDD of numbers
>>>>> val pairedNumbers = numbers.map(number => (number % 100, number))
>>>>> val sortedPairedNumbers = pairedNumbers.sortBy(pairedNumber =>
>>>>> pairedNumber._2) // Sort by values in the pair
>>>>> val aggregates = sortedPairedNumbers.combineByKey(..)
>>>>>
>>>>> In this example, will the combine functions see values in sorted
>>>>> order? What if I had done groupByKey and then combineByKey? What
>>>>> transformations can unsort an already sorted data?
>>>>>
>>>>
>>>>
>>>
>

Reply via email to