Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

Reynold Xin Fri, 17 Apr 2015 13:57:39 -0700

It's because you did a repartition -- which rearranges all the data.

Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.


On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei <zhangxiongfei0...@163.com>
wrote:

> Hi,
> I did some tests on Parquet Files with Spark SQL DataFrame API.
> I generated 36 gzip compressed parquet files by Spark SQL and stored them
> on Tachyon,The size of each file is about  222M.Then read them with below
> code.
> val tfs
> =sqlContext.parquetFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick");
> Next,I just save this DataFrame onto HDFS with below code.It will generate
> 36 parquet files too,but the size of each file is about 265M
>
> tfs.repartition(36).saveAsParquetFile("/user/zhangxf/adClick-parquet-tachyon");
> My question is Why the files on HDFS has different size with those on
> Tachyon even though they come from the same original data?
>
>
> Thanks
> Zhang Xiongfei
>
>

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

Reply via email to