Re: how to merge dataframe write output files

2016-11-10 Thread Jorge Sánchez
Do you have the logs of the containers? This seems like a Memory issue.

2016-11-10 7:28 GMT+00:00 lk_spark :

> hi,all:
> when I call api df.write.parquet ,there is alot of small files :   how
> can I merge then into on file ? I tried df.coalesce(1).write.parquet ,but
> it will get error some times
>
> Container exited with a non-zero exit code 143
>
> more an more...
> -rw-r--r--   2 hadoop supergroup 14.5 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00165-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 16.4 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00166-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00167-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 14.2 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00168-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00169-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 14.4 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00170-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00171-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00172-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 16.0 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00173-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00174-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 14.0 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00175-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc.snappy.parquet
> -rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11
> /parquetdata/weixin/biztags/biztag2/part-r-00176-0f61afe4-
> 23e8-40bb-b30b-09652ca677bc
> more an more...
> 2016-11-10
> --
> lk_spark
>


RE: how to merge dataframe write output files

2016-11-09 Thread Shreya Agarwal
Is there a reason you want to merge the files? The reason you are getting 
errors (afaik) is because when you try to coalesce and then write, you are 
forcing all the content to reside on one executor, and the size of data is 
exceeding the memory you have for storage in your executor, hence causing the 
container to be killed. We can confirm this if you provide the specs of your 
cluster. The whole purpose of multiple files is so that each executor can write 
its partition out in parallel, without having to collect the data in one place.

Not to mention that it’ll make your write incredibly slow and also it’ll take 
away all the speed of reading in the data from a parquet as there won’t be any 
parallelism at the time of input (if you try to input this parquet).

Again, the important question is – Why do you need it to be one file? Are you 
planning to use it externally? If yes, can you not use fragmented files there? 
If the data is too big for the Spark executor, it’ll most certainly be too much 
for JRE or any other runtime  to load in memory on a single box.

From: lk_spark [mailto:lk_sp...@163.com]
Sent: Wednesday, November 9, 2016 11:29 PM
To: user.spark 
Subject: how to merge dataframe write output files

hi,all:
when I call api df.write.parquet ,there is alot of small files :   how can 
I merge then into on file ? I tried df.coalesce(1).write.parquet ,but it will 
get error some times

Container exited with a non-zero exit code 143
more an more...
-rw-r--r--   2 hadoop supergroup 14.5 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00165-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 16.4 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00166-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00167-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 14.2 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00168-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00169-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 14.4 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00170-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00171-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00172-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 16.0 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00173-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 17.1 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00174-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 14.0 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00175-0f61afe4-23e8-40bb-b30b-09652ca677bc.snappy.parquet
-rw-r--r--   2 hadoop supergroup 15.7 K 2016-11-10 15:11 
/parquetdata/weixin/biztags/biztag2/part-r-00176-0f61afe4-23e8-40bb-b30b-09652ca677bc
more an more...
2016-11-10

lk_spark