Cool. Thanks, everyone for the reply. On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack <i...@enrico.minack.dev> wrote:
> If with "won't affect the performance" you mean "parquet is splittable > though it uses snappy", then yes. Splittable files allow for optimal > parallelization, which "won't affect performance". > > Spark writing data will split the data into multiple files already (here > parquet files). Even if each file would not be splittable, your data have > been split already. Splittable parquet files allow for more granularity > (more splitting if your data), in case those files are big. > > Enrico > > > Am 14.09.22 um 21:57 schrieb Sid: > > Okay so you mean to say that parquet compresses the denormalized data > using snappy so it won't affect the performance. > > Only using snappy will affect the performance > > Am I correct? > > On Thu, 15 Sep 2022, 01:08 Amit Joshi, <mailtojoshia...@gmail.com> wrote: > >> Hi Sid, >> >> Snappy itself is not splittable. But the format that contains the actual >> data like parquet (which are basically divided into row groups) can be >> compressed using snappy. >> This works because blocks(pages of parquet format) inside the parquet can >> be independently compressed using snappy. >> >> Thanks >> Amit >> >> On Wed, Sep 14, 2022 at 8:14 PM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Hello experts, >>> >>> I know that Gzip and snappy files are not splittable i.e data won't be >>> distributed into multiple blocks rather it would try to load the data in a >>> single partition/block >>> >>> So, my question is when I write the parquet data via spark it gets >>> stored at the destination with something like *part*.snappy.parquet* >>> >>> So, when I read this data will it affect my performance? >>> >>> Please help me if there is any understanding gap. >>> >>> Thanks, >>> Sid >>> >> >