Re: Splittable or not?

2022-09-19 Thread Jack Goodson
When reading in Gzip files, I’ve always read them into a data frame and then written out to parquet/delta more or less in their raw form and then used these files for my transformations as the workloads are now parallelisable from these split files, when reading in Gzips these will be read by

Re: Splittable or not?

2022-09-19 Thread Sid
Cool. Thanks, everyone for the reply. On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack wrote: > If with "won't affect the performance" you mean "parquet is splittable > though it uses snappy", then yes. Splittable files allow for optimal > parallelization, which "won't affect performance". > >

Re: Splittable or not?

2022-09-17 Thread Enrico Minack
If with "won't affect the performance" you mean "parquet is splittable though it uses snappy", then yes. Splittable files allow for optimal parallelization, which "won't affect performance". Spark writing data will split the data into multiple files already (here parquet files). Even if each

Re: Splittable or not?

2022-09-14 Thread Sid
Okay so you mean to say that parquet compresses the denormalized data using snappy so it won't affect the performance. Only using snappy will affect the performance Am I correct? On Thu, 15 Sep 2022, 01:08 Amit Joshi, wrote: > Hi Sid, > > Snappy itself is not splittable. But the format that

Re: Splittable or not?

2022-09-14 Thread Amit Joshi
Hi Sid, Snappy itself is not splittable. But the format that contains the actual data like parquet (which are basically divided into row groups) can be compressed using snappy. This works because blocks(pages of parquet format) inside the parquet can be independently compressed using snappy.