Yes. Ouch, so there's a 4/3 hit there for base64. (is that always the case or does it use plaintext if possible?)
I'm trying to figure out what kind of request to file in the issue tracker to help support my use case. (data logging) I have enough stuff I want to put in metadata that the use of compression matters to me. (for one file it's not so bad, but we generate so many data log files with different metadata that in aggregate it does matter.) An alternative that I might need to pursue is having a .zip file containing one or more parquet files along with some metadata files; then the barrier to using compression is fairly low.... but I'd like to avoid the complexity overhead of that. If it does make sense to keep any compression as a manual feature, would it be reasonable to ask for the compression mechanism of Parquet as a user-exposed feature? It is a fairly nice interface (at least from the Python bindings) where as a user, all I care about on the compression side is specifying the compression method and the compression level, and the Parquet library takes care of using the correct algorithm; then on the decompression level it does everything based on what it stored in the file. (in other words, binary COMPRESSED_BLOB = compress(binary BLOB, string compression, int compression_level) and binary BLOB = uncompress(binary COMPRESSED_BLOB) -- I can't seem to find an equivalent in Python to do this for standalone usage.) On 2020/11/04 16:41:00, Wes McKinney <[email protected]> wrote: > You mean the key-value metadata at the schema/field-level? That can > be binary (it gets base64-encoded when written to Parquet) > > On Wed, Nov 4, 2020 at 10:22 AM Jason Sachs <[email protected]> wrote: > > > > OK. If I take the manual approach, do parquet / arrow care whether metadata > > is binary or not? > > > > On 2020/11/04 14:16:37, Wes McKinney <[email protected]> wrote: > > > There is not to my knowledge. > > > > > > On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs <[email protected]> wrote: > > > > > > > > Is there any built-in method to compress parquet metadata? From what I > > > > can tell, the main table columns are compressed, but not the metadata. > > > > > > > > I have metadata which includes 100-200KB of text (JSON format) that is > > > > easily compressible... is there any alternative to doing it myself? > > > >
