Re: Compressing parquet metadata?

Jason Sachs Wed, 04 Nov 2020 11:40:11 -0800

Yes. Ouch, so there's a 4/3 hit there for base64. (is that always the case or 
does it use plaintext if possible?)

I'm trying to figure out what kind of request to file in the issue tracker to 
help support my use case. (data logging)

I have enough stuff I want to put in metadata that the use of compression 
matters to me. (for one file it's not so bad, but we generate so many data log 
files with different metadata that in aggregate it does matter.) An alternative 
that I might need to pursue is having a .zip file containing one or more 
parquet files along with some metadata files; then the barrier to using 
compression is fairly low.... but I'd like to avoid the complexity overhead of 
that.

If it does make sense to keep any compression as a manual feature, would it be 
reasonable to ask for the compression mechanism of Parquet as a user-exposed 
feature? It is a fairly nice interface (at least from the Python bindings) 
where as a user, all I care about on the compression side is specifying the 
compression method and the compression level, and the Parquet library takes 
care of using the correct algorithm; then on the decompression level it does 
everything based on what it stored in the file. (in other words, binary 
COMPRESSED_BLOB = compress(binary BLOB, string compression, int 
compression_level) and binary BLOB = uncompress(binary COMPRESSED_BLOB) -- I 
can't seem to find an equivalent in Python to do this for standalone usage.)

On 2020/11/04 16:41:00, Wes McKinney <[email protected]> wrote: 
>  You mean the key-value metadata at the schema/field-level? That can
> be binary (it gets base64-encoded when written to Parquet)
> 
> On Wed, Nov 4, 2020 at 10:22 AM Jason Sachs <[email protected]> wrote:
> >
> > OK. If I take the manual approach, do parquet / arrow care whether metadata 
> > is binary or not?
> >
> > On 2020/11/04 14:16:37, Wes McKinney <[email protected]> wrote:
> > > There is not to my knowledge.
> > >
> > > On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs <[email protected]> wrote:
> > > >
> > > > Is there any built-in method to compress parquet metadata? From what I 
> > > > can tell, the main table columns are compressed, but not the metadata.
> > > >
> > > > I have metadata which includes 100-200KB of text (JSON format) that is 
> > > > easily compressible... is there any alternative to doing it myself?
> > >
>

Re: Compressing parquet metadata?

Reply via email to