I did a quick test as I was curious about it too. I created a file with numbers from 0 to 999, in order, line by line. Then I did:
scala> val numbers = sc.textFile("./numbers.txt") scala> val zipped = numbers.zipWithUniqueId scala> zipped.foreach(i => println(i)) Expected result if the order was preserved would be something like: (0, 0), (1, 1) etc. Unfortunately, the output looks like this: (126,1) (223,2) (320,3) (1,0) (127,11) (2,10) (...) The workaround I found that works for me for my specific use case (relatively small input files) is setting explicitly the number of partitions to 1 when reading a single *text* file: scala> val numbers_sp = sc.textFile("./numbers.txt", 1) Than the output is exactly as I would expect. I didn't dive into the code too much, but I took a very quick look at it and figured out - correct me if I missed something, it's Friday afternoon! ;-) - that this workaround will work fine for all the input formats inheriting from org.apache.hadoop.mapred.FileInputFormat including TextInputFormat, of course - see the implementation of getSplits() method there ( http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29 ). The numSplits variable passed there is exactly the same value as you provide as a second argument to textFile, which is minPartitions. However, while *min* suggests that we can only define a minimal number of partitions, while we have no control over the max, from what I can see in the code, that value specifies the *exact* number of partitions per the FileInputFormat.getSplits implementation. Of course it can differ for other input formats, but in this case it should work just fine. Kind regards, MichaĆ Michalski, michal.michal...@boxever.com On 24 April 2015 at 14:05, Spico Florin <spicoflo...@gmail.com> wrote: > Hello! > I know that HadoopRDD partitions are built based on the number of splits > in HDFS. I'm wondering if these partitions preserve the initial order of > data in file. > As an example, if I have an HDFS (myTextFile) file that has these splits: > > split 0-> line 1, ..., line k > split 1->line k+1,..., line k+n > splt 2->line k+n, line k+n+m > > and the code > val lines=sc.textFile("hdfs://mytextFile") > lines.zipWithIndex() > > will the order of lines preserved? > (line 1, zipIndex 1) , .. (line k, zipIndex k), and so one. > > I found this question on stackoverflow ( > http://stackoverflow.com/questions/26046410/how-can-i-obtain-an-element-position-in-sparks-rdd) > whose answer intrigued me: > "Essentially, RDD's zipWithIndex() method seems to do this, but it won't > preserve the original ordering of the data the RDD was created from" > > Can you please confirm that is this the correct answer? > > Thanks. > Florin > > > > > >