Thanks for the information! (to all who responded) The code below *seems* to work.Any hidden gotcha's that anyone sees? And still, in "terasort", how did they check that the data was actually sorted? :-) -Mike class MyInputFormat[T] extends parquet.hadoop.ParquetInputFormat[T]{ override def getSplits(jobContext: org.apache.hadoop.mapreduce.JobContext) :java.util.List[org.apache.hadoop.mapreduce.InputSplit] = { val splits = super.getSplits(jobContext) import scala.collection.JavaConversions._ splits.sortBy{ split => split match { case fileSplit :org.apache.hadoop.mapreduce.lib.input.FileSplit => (fileSplit.getPath.getName, fileSplit.getStart) case _ => ("",-1L) } } }}
From: Sean Owen <so...@cloudera.com> To: Michael Albert <m_albert...@yahoo.com> Cc: User <user@spark.apache.org> Sent: Monday, March 23, 2015 7:31 AM Subject: Re: How to check that a dataset is sorted after it has been written out? Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably make some custom RDD or InputFormat that reads blocks in a well-defined order and, assuming the data is sorted in some knowable way on disk, then must have them sorted. I think that's even been brought up. Deciding whether the data is sorted is quite different. You'd have to decide what ordering you expect (is part 0 before part 1? should it be sorted in a part file?) and then just verify that externally. On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert <m_albert...@yahoo.com.invalid> wrote: > Greetings! > > I sorted a dataset in Spark and then wrote it out in avro/parquet. > > Then I wanted to check that it was sorted. > > It looks like each partition has been sorted, but when reading in, the first > "partition" (i.e., as > seen in the partition index of mapPartitionsWithIndex) is not the same as > implied by > the names of the parquet files (even when the number of partitions is the > same in the > rdd which was read as on disk). > > If I "take()" a few hundred values, they are sorted, but they are *not* the > same as if I > explicitly open "part-r-00000.parquet" and take values from that. > > It seems that when opening the rdd, the "partitions" of the rdd are not in > the same > order as implied by the data on disk (i.e., "part-r-00000.parquet, > part-r-00001.parquet, etc). > > So, how might one read the data so that one maintains the sort order? > > And while on the subject, after the "terasort", how did they check that the > data was actually sorted correctly? (or did they :-) ? ). > > Is there any way to read the data back in so as to preserve the sort, or do > I need to > "zipWithIndex" before writing it out, and write the index at that time? (I > haven't tried the > latter yet). > > Thanks! > -Mike > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org