Re: Will it lead to OOM error?

2022-06-22 Thread Sid
Thanks all for your answers. Much appreciated.

On Thu, Jun 23, 2022 at 6:07 AM Yong Walt  wrote:

> We have many cases like this. it won't cause OOM.
>
> Thanks
>
> On Wed, Jun 22, 2022 at 8:28 PM Sid  wrote:
>
>> I have a 150TB CSV file.
>>
>> I have a total of 100 TB RAM and 100TB disk. So If I do something like
>> this
>>
>> spark.read.option("header","true").csv(filepath).show(false)
>>
>> Will it lead to an OOM error since it doesn't have enough memory? or it
>> will spill data onto the disk and process it?
>>
>> Thanks,
>> Sid
>>
>


Re: Will it lead to OOM error?

2022-06-22 Thread Yong Walt
We have many cases like this. it won't cause OOM.

Thanks

On Wed, Jun 22, 2022 at 8:28 PM Sid  wrote:

> I have a 150TB CSV file.
>
> I have a total of 100 TB RAM and 100TB disk. So If I do something like this
>
> spark.read.option("header","true").csv(filepath).show(false)
>
> Will it lead to an OOM error since it doesn't have enough memory? or it
> will spill data onto the disk and process it?
>
> Thanks,
> Sid
>


Re: Will it lead to OOM error?

2022-06-22 Thread Enrico Minack
Yes, a single file compressed with a non-splitable compression (e.g. 
gzip) would have to be read by a single executor. That takes forever.


You should consider to recompress the file with a splitable compression 
first. You will not want to read that file more than once, so you should 
uncompress it only once (in order to recompress).


Enrico


Am 22.06.22 um 20:17 schrieb Sid:

Hi Enrico,

Thanks for the insights.

Could you please help me to understand with one example of compressed 
files where the file wouldn't be split in partitions and will put load 
on a single partition and might lead to OOM error?


Thanks,
Sid

On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack  
wrote:


The RAM and disk memory consumtion depends on what you do with the
data after reading them.

Your particular action will read 20 lines from the first partition
and show them. So it will not use any RAM or disk, no matter how
large the CSV is.

If you do a count instead of show, it will iterate over the each
partition and return a count per partition, so no RAM here needed
as well.

If you do some real processing of the data, the requirement RAM
and disk again depends on involved shuffles and intermediate
results that need to be store in RAM or on disk.

Enrico


Am 22.06.22 um 14:54 schrieb Deepak Sharma:

It will spill to disk if everything can’t be loaded in memory .


On Wed, 22 Jun 2022 at 5:58 PM, Sid  wrote:

I have a 150TB CSV file.

I have a total of 100 TB RAM and 100TB disk. So If I do
something like this

spark.read.option("header","true").csv(filepath).show(false)

Will it lead to an OOM error since it doesn't have enough
memory? or it will spill data onto the disk and process it?

Thanks,
Sid

-- 
Thanks

Deepak
www.bigdatabig.com 
www.keosha.net 





Re: Will it lead to OOM error?

2022-06-22 Thread Sid
Hi Enrico,

Thanks for the insights.

Could you please help me to understand with one example of compressed files
where the file wouldn't be split in partitions and will put load on a
single partition and might lead to OOM error?

Thanks,
Sid

On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack 
wrote:

> The RAM and disk memory consumtion depends on what you do with the data
> after reading them.
>
> Your particular action will read 20 lines from the first partition and
> show them. So it will not use any RAM or disk, no matter how large the CSV
> is.
>
> If you do a count instead of show, it will iterate over the each partition
> and return a count per partition, so no RAM here needed as well.
>
> If you do some real processing of the data, the requirement RAM and disk
> again depends on involved shuffles and intermediate results that need to be
> store in RAM or on disk.
>
> Enrico
>
>
> Am 22.06.22 um 14:54 schrieb Deepak Sharma:
>
> It will spill to disk if everything can’t be loaded in memory .
>
>
> On Wed, 22 Jun 2022 at 5:58 PM, Sid  wrote:
>
>> I have a 150TB CSV file.
>>
>> I have a total of 100 TB RAM and 100TB disk. So If I do something like
>> this
>>
>> spark.read.option("header","true").csv(filepath).show(false)
>>
>> Will it lead to an OOM error since it doesn't have enough memory? or it
>> will spill data onto the disk and process it?
>>
>> Thanks,
>> Sid
>>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>


Re: Will it lead to OOM error?

2022-06-22 Thread Enrico Minack
The RAM and disk memory consumtion depends on what you do with the data 
after reading them.


Your particular action will read 20 lines from the first partition and 
show them. So it will not use any RAM or disk, no matter how large the 
CSV is.


If you do a count instead of show, it will iterate over the each 
partition and return a count per partition, so no RAM here needed as well.


If you do some real processing of the data, the requirement RAM and disk 
again depends on involved shuffles and intermediate results that need to 
be store in RAM or on disk.


Enrico


Am 22.06.22 um 14:54 schrieb Deepak Sharma:

It will spill to disk if everything can’t be loaded in memory .


On Wed, 22 Jun 2022 at 5:58 PM, Sid  wrote:

I have a 150TB CSV file.

I have a total of 100 TB RAM and 100TB disk. So If I do something
like this

spark.read.option("header","true").csv(filepath).show(false)

Will it lead to an OOM error since it doesn't have enough memory?
or it will spill data onto the disk and process it?

Thanks,
Sid

--
Thanks
Deepak
www.bigdatabig.com 
www.keosha.net 




Re: Will it lead to OOM error?

2022-06-22 Thread Deepak Sharma
It will spill to disk if everything can’t be loaded in memory .


On Wed, 22 Jun 2022 at 5:58 PM, Sid  wrote:

> I have a 150TB CSV file.
>
> I have a total of 100 TB RAM and 100TB disk. So If I do something like this
>
> spark.read.option("header","true").csv(filepath).show(false)
>
> Will it lead to an OOM error since it doesn't have enough memory? or it
> will spill data onto the disk and process it?
>
> Thanks,
> Sid
>
-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net