You're right the distinctions are murky, including in my own comments
here. Anyway, zipping Parquet files would be like zipping JPEGs or
PDFs. Zip acts like tar in these cases but I guess a tarball of JPEGs
is not unheard of *shrug*.
Re. your last question, there is work that must be done in Drill to
support new codecs, even though they are already standardised, and
possibly even implemented in an upstream version of parquet-mr etc.
On 2021/06/18 16:15, Leyne, Sean wrote:
James,
-----Original Message-----
From: James Turton <[email protected]>
Zip is a file format, not a codec. Various codecs are employed in Zip archives,
most commonly DEFLATE. The different set of codecs that are supported in
the Parquet file format are described in https://github.com/apache/parquet-
format/blob/master/Compression.md.
Thanks for the link, the problem is that often the codec and the file format
are synonymous, so people like myself don't make the distinction.
Not helping is the Drill use of the ambiguous "Compression Type" terminology rather than
"codec" in the Drill options.
Since, then, Zip is not sensible or possible inside a Parquet file, the only
way to
effect what you describe would be to embed a Parquet file inside a Zip
archive. This would be perverse and misguided but possibly still queryable
since Drill might transparently do the right things to decode it anyway. Using
a
supported codec within the Parquet file format and forgetting about Zip is
certainly a better approach.
Might seem perverse to you, however, given that that "zip compression" support
for text file was added in v1.17.0 (DRILL-5674)*, I think it is a reasonable question to
ask about support for Parquet files.
*there were no details on which of the codecs are supported.
If you want compression ratios comparable to
those found in Zip files then you would choose GZip and pay with CPU
cycles. When Drill gains support for Zstandard there will be little reason to
choose anything else.
This is another area of confusion, if Parquet provides support for ZSTD (as
well as other codecs) why doesn't Drill?
Isn't there a standard "Parquet Library" that is available which enables Parquet file
support with all "features", which any project implementing Parquet file support would
use?
On 2021/06/17 18:59, Leyne, Sean wrote:
Luoc,
Could you please tell me first which case you are talking about?
Only write(CTAS syntax) or read(SELECT)?
Really both, since you need a mechanism to create the zip'd parquet file to
begin with. Having to create a special/side process to zip the file outside of
drill would be ... awkward.
Sean
在 2021年6月16日,02:26,Leyne, Sean
<[email protected]> 写道:
All,
The documentation describes that gzip/gz compression as supported
for
text files, and that snappy and gzip are support for parquet files.
I have also read that zip compression was also added (though not
documented) for text files.
But is zip also supported for parquet files?
What about support for other compression algorithms/methods? LZ4?
Bzip2? zstd??
Sean