Re: After calling persist, why the size in sparkui is not matching with the actual file size

Denis Bolshakov Mon, 29 Aug 2016 08:33:38 -0700

Hello,

Spark uses snappy by default, is your original file compressed?
Also it keeps data in own representation format (column base), and it's not
the same as text.


Best regards,
Denis

On 29 August 2016 at 16:52, Rohit Kumar Prusty <rohit_pru...@infosys.com>
wrote:

> Hi Team,
>
> I am new to spark and have this basic question. After calling persist, why
> the size in sparkui is not matching with the actual file size?
>
>
>
> Actaul File Size for “/user/rohit_prusty/application2.log” – *39 KB*
>
>
>
> Code snippet:
>
> ===========
>
> logData = sc.textFile("/user/rohit_prusty/application2.log")
>
> logData.persist()
>
> logData.count()
>
> errors = logData.filter(lambda line: "ERROR" in line)
>
> errors.persist()
>
> errors.count()
>
>
>
> Output in SparkUI
>
> ==============
>
> logData RDD takes *2.1 KB*
>
> errors RDD takes *1.3 KB*
>
>
>
> Regards
>
> Rohit Kumar Prusty
>
> +91-9884070075
>
>
>



-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.de...@gmail.com

Re: After calling persist, why the size in sparkui is not matching with the actual file size

Reply via email to