Re: writing a small csv to HDFS is super slow

2019-03-27 Thread Gezim Sejdiu
Hi Lian, many thanks for the detailed information and sharing the solution with us. I will forward this to a student and hopefully will resolve the issue. Best regards, On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang wrote: > Hi Gezim, > > My execution plan of the data frame to write into HDFS is a

Re: writing a small csv to HDFS is super slow

2019-03-26 Thread Lian Jiang
Hi Gezim, My execution plan of the data frame to write into HDFS is a union of 140 children dataframes. All these children data frames are not materialized when writing to HDFS. It is not saving file taking time. Instead, it is materializing the dataframes taking time. My solution is to materializ

Re: writing a small csv to HDFS is super slow

2019-03-26 Thread Gezim Sejdiu
Hi Lian, I was following the thread since one of my students had the same issue. The problem was when trying to save a larger XML dataset into HDFS and due to the connectivity timeout between Spark and HDFS, the output wasn't able to be displayed. I also suggested him to do the same as @Apostolos

Re: writing a small csv to HDFS is super slow

2019-03-25 Thread Lian Jiang
Thanks guys for reply. The execution plan shows a giant query. After divide and conquer, saving is quick. On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama wrote: > Hi Lian, > Since you using repartition(1), do you want to decrease the number of > partitions? If so, have you tried to use coalesce

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread kathy Harayama
Hi Lian, Since you using repartition(1), do you want to decrease the number of partitions? If so, have you tried to use coalesce instead? Kathleen On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang wrote: > Hi, > > Writing a csv to HDFS takes about 1 hour: > > > df.repartition(1).write.format('com.data

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread Apostolos N. Papadopoulos
Is it also slow when you do not repartition? (i.e., to get multiple output files) Also did you try simply saveAsTextFile? Also, before repartition, how many partitions are there? a. On 22/3/19 23:34, Lian Jiang wrote: Hi, Writing a csv to HDFS takes about 1 hour: df.repartition(1).write.f

writing a small csv to HDFS is super slow

2019-03-22 Thread Lian Jiang
Hi, Writing a csv to HDFS takes about 1 hour: df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv) The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem). Other people have similar issues but I don't see