Spliting parquet file into 5 row groups, leads to same benefit as creating 5
parquet files each 1 row group instead.
Also the later can involve more parallelism for writes.
Am I missing something?
On July 20, 2023 12:38:54 PM UTC, sagar sumit wrote:
>Good questions! The idea is to be able to
Good questions! The idea is to be able to skip rowgroups based on index.
But, if we have to do a full snapshot load, then our wrapper should actually
be doing batch GET on S3. Why incur 5x more calls.
As for the update, I think this is in the context of COW. So, the footer
will be
recomputed
Hi,
Multiple idenpendant initiatives for fast copy on write have emerged
(correct me if I am wrong):
1.
https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
2.
https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
The idea is to