There aren't any guarantees on the order that partitions are combined in the 'saveAsTextFile' method. Generally the file will be written in per-partition blocks, but there's no notion of order of the partitions. If order matters to you you can do a sortByKey at load time.
Can you provide a reproducible example of the behavior you're seeing (say from the spark shell)? It's difficult to provide guidance based on the code you sent. On Thu, Jan 30, 2014 at 10:24 AM, Archit Thakur <[email protected]>wrote: > Yes, I do that. But if I go to my worker node and check for the list it > has printed > > > > MyRdd.flatmap(func(_)) > MyRdd.saveAsTextFile(..) > > func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = { > > // > > *println(list)* > list > } > > > > The records differ( only count match). > > > On Thu, Jan 30, 2014 at 11:48 PM, Evan R. Sparks <[email protected]>wrote: > >> Actually - looking at your use case, you may simply be saving the >> original RDD >> Doing something like: >> val newRdd = MyRdd.flatMap(func) >> newRdd.saveAsTextFile(...) >> >> May solve your issue. >> >> >> On Thu, Jan 30, 2014 at 10:17 AM, Evan R. Sparks >> <[email protected]>wrote: >> >>> Could it be that you have the same records that you get back from >>> flatMap, just in a different order? >>> >>> >>> On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur < >>> [email protected]> wrote: >>> >>>> Needless to say, it works fine with int/string(primitive) type. >>>> >>>> >>>> On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am facing a general problem with flatmap operation on rdd. >>>>> >>>>> I am doing >>>>> >>>>> MyRdd.flatmap(func(_)) >>>>> MyRdd.saveAsTextFile(..) >>>>> >>>>> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = { >>>>> >>>>> // >>>>> >>>>> println(list) >>>>> list >>>>> } >>>>> >>>>> now if I check the list from the logs at worker and check the textfile >>>>> it has created, it differs. >>>>> >>>>> Only the no. of records are same, but the actual records in the file >>>>> differs from one in the logs. >>>>> >>>>> Does Spark modifies keys/values in between? What other operations does >>>>> it perform with Key or Value? >>>>> >>>>> Thanks and Regards, >>>>> Archit Thakur. >>>>> >>>>> >>>> >>> >> >
