It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different.
On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei <zhangxiongfei0...@163.com> wrote: > Hi, > I did some tests on Parquet Files with Spark SQL DataFrame API. > I generated 36 gzip compressed parquet files by Spark SQL and stored them > on Tachyon,The size of each file is about 222M.Then read them with below > code. > val tfs > =sqlContext.parquetFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick"); > Next,I just save this DataFrame onto HDFS with below code.It will generate > 36 parquet files too,but the size of each file is about 265M > > tfs.repartition(36).saveAsParquetFile("/user/zhangxf/adClick-parquet-tachyon"); > My question is Why the files on HDFS has different size with those on > Tachyon even though they come from the same original data? > > > Thanks > Zhang Xiongfei > >