When you have multiple parquet files, the order of all the rows in them is not defined.
On Sat, May 7, 2016 at 11:48 PM, Buntu Dev <buntu...@gmail.com> wrote: > I'm using pyspark dataframe api to sort by specific column and then saving > the dataframe as parquet file. But the resulting parquet file doesn't seem > to be sorted. > > Applying sort and doing a head() on the results shows the correct results > sorted by 'value' column in desc order, as shown below: > > ~~~~~ >>>df=sqlContext.read.parquet("/some/file.parquet") >>>df.printSchema() > > root > |-- c1: string (nullable = true) > |-- c2: string (nullable = true) > |-- value: double (nullable = true) > >>>df.sort(df.value.desc()).head(3) > > [Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785', > value=890.6111111111111), Row(c1=u'546', c2=u'234', value=776.45)] > ~~~~~~ > > But saving the sorted dataframe as parquet and fetching the first N rows > using head() doesn't seem to return the results ordered by 'value' column: > > ~~~~ >>>df=sqlContext.read.parquet("/some/file.parquet") >>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet") > ... >>>df2=sqlContext.read.parquet("/sorted/file.parquet") >>>df2.head(3) > > [Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672', c2=u'9098', > value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)] > ~~~~ > > How do I go about sorting and saving a sorted dataframe? > > > Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org