I did a quick search in Parquet-MR and found at least one place where
different files are explicitly forbidden [1]. I don't know if this blocks
all reading or is a specific case (I'm not sure if writing is allowed for
multiple columns).
Like I said, it makes sense, but is potentially a big
I believe the formal Parquet standard already allows a file per column. At
least I remember it being discussed when the spec was first implemented. If
you look at the thrift spec it actually allows for this:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L771
> I'd suggest a new write pattern. Write the columns page at a time to
separate files then use a second process to concatenate the columns and
append the footer. Odds are you would do better than os swapping and take
memory requirements down to
page size times field count.
This is exactly what a
This is an interesting idea. For s3 multipart uploads one might run into
limitations pretty quickly (only 10k parts appear to be supported. all but
the last are expected to be at least 5mb if I read their docs correctly [1])
[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
On
I'd suggest a new write pattern. Write the columns page at a time to
separate files then use a second process to concatenate the columns and
append the footer. Odds are you would do better than os swapping and take
memory requirements down to page size times field count.
In s3 I believe you could
Hi,
I wasn't aware of the fact that jemalloc mmap automatically for larger
allocations. And I didn't yet test this.
The approach could be different in that we would know which parts of the
buffers are going to be used next (the buffers are appendonly) and which
parts won't be needed until
+parquet-dev as this seems more concerned with the non-arrow pieces of
parquet
Hi Roman,
Answers inline.
One way to solve that problem would be to use memory mapped files instead
> of plain memory buffers. That way, the number of required memory can be
> limited by the number of columns times
Hi everyone,
since some time now, parquet::ParquetFileWriter has the option to create
buffered rowgroups with AppendBufferedRowGroup(), which basically gives you
the possibility to write to columns in any order you like (in contrast to
the former only possible way of writing one column after the