It depends a lot on the file. Parquet's encoding strategy is a lot
different from a gzipped CSV, there are some cases where a Parquet
file will be 10x smaller than a csv.gz and other cases where the
Parquet file will be larger. The Parquet file's metadata could give
you an idea of how large each compressed column chunk is, which might
give an idea which columns are compressing poorly. For example
In [1]: import pyarrow.parquet as pq
In [2]: pf =
pq.ParquetFile('/home/wesm/code/arrow/cpp/submodules/parquet-testing/data/alltypes_plain.parquet')
In [3]: for i in range(pf.metadata.num_row_groups):
...: for j in range(pf.metadata.num_columns):
...: col = pf.metadata.row_group(i).column(j)
...: print("row group {} column {} compressed size
{}".format(i, j, col.total_compressed_size))
...:
row group 0 column 0 compressed size 73
row group 0 column 1 compressed size 24
row group 0 column 2 compressed size 47
row group 0 column 3 compressed size 47
row group 0 column 4 compressed size 47
row group 0 column 5 compressed size 55
row group 0 column 6 compressed size 47
row group 0 column 7 compressed size 55
row group 0 column 8 compressed size 88
row group 0 column 9 compressed size 49
row group 0 column 10 compressed size 13
On Tue, Feb 25, 2020 at 4:33 PM Samrat Batth <[email protected]> wrote:
>
> I am a new pyarrow/parquet user.
>
> I ran the following test:
> - 18mb zipped csv file (approx 1.5 mil rows) which has data for one month
> - saved it as parquet file partitioned on date with default compression and
> see the parquet file size at ~45mb. If I don’t partition on date then the
> file size is ~30mb.
>
> My expectation was that the parquet file size would be less than zipped csv
> file - any comments?
> Thx
>