Re: Splittable or not?

Sid Mon, 19 Sep 2022 02:45:44 -0700

Cool. Thanks, everyone for the reply.

On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack <i...@enrico.minack.dev>
wrote:


> If with "won't affect the performance" you mean "parquet is splittable
> though it uses snappy", then yes. Splittable files allow for optimal
> parallelization, which "won't affect performance".
>
> Spark writing data will split the data into multiple files already (here
> parquet files). Even if each file would not be splittable, your data have
> been split already. Splittable parquet files allow for more granularity
> (more splitting if your data), in case those files are big.
>
> Enrico
>
>
> Am 14.09.22 um 21:57 schrieb Sid:
>
> Okay so you mean to say that parquet compresses the denormalized data
> using snappy so it won't affect the performance.
>
> Only using snappy will affect the performance
>
> Am I correct?
>
> On Thu, 15 Sep 2022, 01:08 Amit Joshi, <mailtojoshia...@gmail.com> wrote:
>
>> Hi Sid,
>>
>> Snappy itself is not splittable. But the format that contains the actual
>> data like parquet (which are basically divided into row groups) can be
>> compressed using snappy.
>> This works because blocks(pages of parquet format) inside the parquet can
>> be independently compressed using snappy.
>>
>> Thanks
>> Amit
>>
>> On Wed, Sep 14, 2022 at 8:14 PM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Hello experts,
>>>
>>> I know that Gzip and snappy files are not splittable i.e data won't be
>>> distributed into multiple blocks rather it would try to load the data in a
>>> single partition/block
>>>
>>> So, my question is when I write the parquet data via spark it gets
>>> stored at the destination with something like *part*.snappy.parquet*
>>>
>>> So, when I read this data will it affect my performance?
>>>
>>> Please help me if there is any understanding gap.
>>>
>>> Thanks,
>>> Sid
>>>
>>
>

Re: Splittable or not?

Reply via email to