When reading in Gzip files, I’ve always read them into a data frame and then
written out to parquet/delta more or less in their raw form and then used these
files for my transformations as the workloads are now parallelisable from these
split files, when reading in Gzips these will be read by
Cool. Thanks, everyone for the reply.
On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack
wrote:
> If with "won't affect the performance" you mean "parquet is splittable
> though it uses snappy", then yes. Splittable files allow for optimal
> parallelization, which "won't affect performance".
>
>
If with "won't affect the performance" you mean "parquet is splittable
though it uses snappy", then yes. Splittable files allow for optimal
parallelization, which "won't affect performance".
Spark writing data will split the data into multiple files already (here
parquet files). Even if each
Okay so you mean to say that parquet compresses the denormalized data using
snappy so it won't affect the performance.
Only using snappy will affect the performance
Am I correct?
On Thu, 15 Sep 2022, 01:08 Amit Joshi, wrote:
> Hi Sid,
>
> Snappy itself is not splittable. But the format that
Hi Sid,
Snappy itself is not splittable. But the format that contains the actual
data like parquet (which are basically divided into row groups) can be
compressed using snappy.
This works because blocks(pages of parquet format) inside the parquet can
be independently compressed using snappy.