Of course after you do it, you probably want to call repartition(somevalue) on your RDD to "get your paralellism back".
Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 15:28, Michal Michalski <michal.michal...@boxever.com> wrote: > I did a quick test as I was curious about it too. I created a file with > numbers from 0 to 999, in order, line by line. Then I did: > > scala> val numbers = sc.textFile("./numbers.txt") > scala> val zipped = numbers.zipWithUniqueId > scala> zipped.foreach(i => println(i)) > > Expected result if the order was preserved would be something like: (0, > 0), (1, 1) etc. > Unfortunately, the output looks like this: > > (126,1) > (223,2) > (320,3) > (1,0) > (127,11) > (2,10) > (...) > > The workaround I found that works for me for my specific use case > (relatively small input files) is setting explicitly the number of > partitions to 1 when reading a single *text* file: > > scala> val numbers_sp = sc.textFile("./numbers.txt", 1) > > Than the output is exactly as I would expect. > > I didn't dive into the code too much, but I took a very quick look at it > and figured out - correct me if I missed something, it's Friday afternoon! > ;-) - that this workaround will work fine for all the input formats > inheriting from org.apache.hadoop.mapred.FileInputFormat including > TextInputFormat, of course - see the implementation of getSplits() method > there ( > http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29 > ). > The numSplits variable passed there is exactly the same value as you > provide as a second argument to textFile, which is minPartitions. However, > while *min* suggests that we can only define a minimal number of > partitions, while we have no control over the max, from what I can see in > the code, that value specifies the *exact* number of partitions per the > FileInputFormat.getSplits implementation. Of course it can differ for other > input formats, but in this case it should work just fine. > > > Kind regards, > Michał Michalski, > michal.michal...@boxever.com > > On 24 April 2015 at 14:05, Spico Florin <spicoflo...@gmail.com> wrote: > >> Hello! >> I know that HadoopRDD partitions are built based on the number of >> splits in HDFS. I'm wondering if these partitions preserve the initial >> order of data in file. >> As an example, if I have an HDFS (myTextFile) file that has these splits: >> >> split 0-> line 1, ..., line k >> split 1->line k+1,..., line k+n >> splt 2->line k+n, line k+n+m >> >> and the code >> val lines=sc.textFile("hdfs://mytextFile") >> lines.zipWithIndex() >> >> will the order of lines preserved? >> (line 1, zipIndex 1) , .. (line k, zipIndex k), and so one. >> >> I found this question on stackoverflow ( >> http://stackoverflow.com/questions/26046410/how-can-i-obtain-an-element-position-in-sparks-rdd) >> whose answer intrigued me: >> "Essentially, RDD's zipWithIndex() method seems to do this, but it won't >> preserve the original ordering of the data the RDD was created from" >> >> Can you please confirm that is this the correct answer? >> >> Thanks. >> Florin >> >> >> >> >> >> >