Padova Apache Spark Meetup

2018-09-05 Thread Matteo Durighetto
Hello,
we are creating a new meetup of enthusiast Apache Spark Users
in Italy at Padova


https://www.meetup.com/Padova-Apache-Spark-Meetup/

Is it possible to  add the meetup link to the web page
https://spark.apache.org/community.html ?

Moreover is it possible to announce future events in this mailing list ?


Kind Regards

Matteo Durighetto
e-mail: m.durighe...@miriade.it
supporto kandula :
database : support...@miriade.it
business intelligence : support...@miriade.it
infrastructure : supp...@miriade.it

M I R I A D E - P L A Y  T H E  C H A N G E

Via Castelletto 11, 36016 Thiene VI
Tel. 0445030111 - Fax 0445030100
Website: http://www.miriade.it/

<https://www.facebook.com/MiriadePlayTheChange/>
<https://www.linkedin.com/company/miriade-s-p-a-?trk=company_logo>
<https://plus.google.com/+MiriadeIt/about>

 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Le informazioni contenute in questa e-mail sono destinate alla persona
alla quale sono state inviate. Nel rispetto della legge, dei
regolamenti e delle normative vigenti, questa e-mail non deve essere
resa pubblica poiché potrebbe contenere informazioni di natura
strettamente confidenziale. Qualsiasi persona che al di fuori del
destinatario dovesse riceverla o dovesse entrarne in possesso non é
autorizzata a leggerla, diffonderla, inoltrarla o duplicarla. Se chi
legge non é il destinatario del messaggio e' pregato di avvisare
immediatamente il mittente e successivamente di eliminarlo. Miriade
declina ogni responsabilità per l'incompleta e l'errata trasmissione
di questa e-mail o per un ritardo nella ricezione della stessa.


Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Matteo Durighetto
Hello, try to use parquet format with compression ( like snappy or lz4 ) so
the produced files will be smaller and it will generate less i/o. Moreover
normally parquet is more faster than csv format in reading for further
operations .
Another possible format is ORC file.

Kind Regards

Matteo


2018-03-09 11:23 GMT+01:00 Md. Rezaul Karim :

> Dear All,
>
> I have a tiny CSV file, which is around 250MB. There are only 30 columns
> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
> another CSV file on disk for later usage.
>
> However, I'm getting pissed off as writing the resultant DataFrame is
> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
> file written on the disk is about 58GB!
>
> Here's the sample code that I tried:
>
> # Using repartition()
> myDF.repartition(1).write.format("com.databricks.spark.
> csv").save("data/file.csv")
>
> # Using coalesce()
> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("
> data/file.csv")
>
>
> Any better suggestion?
>
>
>
> 
> Md. Rezaul Karim, BSc, MSc
> Research Scientist, Fraunhofer FIT, Germany
>
> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
>
> eMail: rezaul.ka...@fit.fraunhofer.de 
> Tel: +49 241 80-21527 <+49%20241%208021527>
>