Hi Lian,
many thanks for the detailed information and sharing the solution with us.
I will forward this to a student and hopefully will resolve the issue.
Best regards,
On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang wrote:
> Hi Gezim,
>
> My execution plan of the data frame to write into HDFS is a
Hi Gezim,
My execution plan of the data frame to write into HDFS is a union of 140
children dataframes. All these children data frames are not materialized
when writing to HDFS. It is not saving file taking time. Instead, it is
materializing the dataframes taking time. My solution is to materializ
Hi Lian,
I was following the thread since one of my students had the same issue. The
problem was when trying to save a larger XML dataset into HDFS and due to
the connectivity timeout between Spark and HDFS, the output wasn't able to
be displayed.
I also suggested him to do the same as @Apostolos
Thanks guys for reply.
The execution plan shows a giant query. After divide and conquer, saving is
quick.
On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama
wrote:
> Hi Lian,
> Since you using repartition(1), do you want to decrease the number of
> partitions? If so, have you tried to use coalesce
Hi Lian,
Since you using repartition(1), do you want to decrease the number of
partitions? If so, have you tried to use coalesce instead?
Kathleen
On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang wrote:
> Hi,
>
> Writing a csv to HDFS takes about 1 hour:
>
>
> df.repartition(1).write.format('com.data
Is it also slow when you do not repartition? (i.e., to get multiple
output files)
Also did you try simply saveAsTextFile?
Also, before repartition, how many partitions are there?
a.
On 22/3/19 23:34, Lian Jiang wrote:
Hi,
Writing a csv to HDFS takes about 1 hour:
df.repartition(1).write.f
Hi,
Writing a csv to HDFS takes about 1 hour:
df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
The generated csv file is only about 150kb. The job uses 3 containers (13
cores, 23g mem).
Other people have similar issues but I don't see