Definitely can't see a benefit to use 30MB row groups over just creating 30MB
parquet files.
I would add that stats indexes are on the file level, so it's in favor to using
row groups size=file size.
The only context it would help is when clustering is setup and targets 1GB
files, w/ 128MB
Spliting parquet file into 5 row groups, leads to same benefit as creating 5
parquet files each 1 row group instead.
Also the later can involve more parallelism for writes.
Am I missing something?
On July 20, 2023 12:38:54 PM UTC, sagar sumit wrote:
>Good questions! The idea is to be able to
Good questions! The idea is to be able to skip rowgroups based on index.
But, if we have to do a full snapshot load, then our wrapper should actually
be doing batch GET on S3. Why incur 5x more calls.
As for the update, I think this is in the context of COW. So, the footer
will be
recomputed
Hi,
Multiple idenpendant initiatives for fast copy on write have emerged
(correct me if I am wrong):
1.
https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
2.
https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
The idea is to