Padova Apache Spark Meetup
Hello, we are creating a new meetup of enthusiast Apache Spark Users in Italy at Padova https://www.meetup.com/Padova-Apache-Spark-Meetup/ Is it possible to add the meetup link to the web page https://spark.apache.org/community.html ? Moreover is it possible to announce future events in this mailing list ? Kind Regards Matteo Durighetto e-mail: m.durighe...@miriade.it supporto kandula : database : support...@miriade.it business intelligence : support...@miriade.it infrastructure : supp...@miriade.it M I R I A D E - P L A Y T H E C H A N G E Via Castelletto 11, 36016 Thiene VI Tel. 0445030111 - Fax 0445030100 Website: http://www.miriade.it/ <https://www.facebook.com/MiriadePlayTheChange/> <https://www.linkedin.com/company/miriade-s-p-a-?trk=company_logo> <https://plus.google.com/+MiriadeIt/about> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Le informazioni contenute in questa e-mail sono destinate alla persona alla quale sono state inviate. Nel rispetto della legge, dei regolamenti e delle normative vigenti, questa e-mail non deve essere resa pubblica poiché potrebbe contenere informazioni di natura strettamente confidenziale. Qualsiasi persona che al di fuori del destinatario dovesse riceverla o dovesse entrarne in possesso non é autorizzata a leggerla, diffonderla, inoltrarla o duplicarla. Se chi legge non é il destinatario del messaggio e' pregato di avvisare immediatamente il mittente e successivamente di eliminarlo. Miriade declina ogni responsabilità per l'incompleta e l'errata trasmissione di questa e-mail o per un ritardo nella ricezione della stessa.
Re: Writing a DataFrame is taking too long and huge space
Hello, try to use parquet format with compression ( like snappy or lz4 ) so the produced files will be smaller and it will generate less i/o. Moreover normally parquet is more faster than csv format in reading for further operations . Another possible format is ORC file. Kind Regards Matteo 2018-03-09 11:23 GMT+01:00 Md. Rezaul Karim: > Dear All, > > I have a tiny CSV file, which is around 250MB. There are only 30 columns > in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an > another CSV file on disk for later usage. > > However, I'm getting pissed off as writing the resultant DataFrame is > taking too long, which is about 4 to 5 hours. Nevertheless, the size of the > file written on the disk is about 58GB! > > Here's the sample code that I tried: > > # Using repartition() > myDF.repartition(1).write.format("com.databricks.spark. > csv").save("data/file.csv") > > # Using coalesce() > myDF. coalesce(1).write.format("com.databricks.spark.csv").save(" > data/file.csv") > > > Any better suggestion? > > > > > Md. Rezaul Karim, BSc, MSc > Research Scientist, Fraunhofer FIT, Germany > > Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany > > eMail: rezaul.ka...@fit.fraunhofer.de > Tel: +49 241 80-21527 <+49%20241%208021527> >