subject:"How to write a RDD into One Local Existing File\?"

Re: How to write a RDD into One Local Existing File?

2014-10-20 Thread Akhil Das

If you don't need part-xxx files in the output but 1 file, then you should
repartition (or coalesce) the RDD into 1 (This will be bottleneck since you
are disabling the parallelism - its like giving everything to 1 machine to
process). You are better off merging those part-xxx files afterwards spark
in hdfs (use hadoop fs -getmerge)

Thanks
Best Regards

On Mon, Oct 20, 2014 at 10:01 AM, Rishi Yadav ri...@infoobjects.com wrote:

Write to hdfs and then get one file locally bu using hdfs dfs
-getmerge...

On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote:

You can save to a local file. What are you trying and what doesn't work?

You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.

How about just combining the files on HDFS afterwards? or just reading
all the files instead of 1? You can hdfs dfs -cat a bunch of files at
once.

On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com wrote:
Hi,

I have a spark mapreduce task which requires me to write the final rdd
to an
existing local file (appending to this file). I tried two ways but
neither
works well:

1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
to
local, but I never make it work. Moreover, the result is not one file
but a
series of part-x files which is not what I hope to get.

2. collect the rdd to an array and write it to the driver node using
Java's
File IO. There are also two problems: 1) my RDD is huge(1TB), which
cannot
fit into the memory of one driver node. I have to split the task into
small
pieces and collect them part by part and write; 2) During the writing by
Java IO, the Spark Mapreduce task has to wait, which is not efficient.

Could anybody provide me an efficient way to solve this problem? I wish
that
the solution could be like: appending a huge rdd to a local file without
pausing the MapReduce during writing?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
- Rishi

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav

Write to hdfs and then get one file locally bu using hdfs dfs -getmerge...

On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote:

You can save to a local file. What are you trying and what doesn't work?

You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.

How about just combining the files on HDFS afterwards? or just reading
all the files instead of 1? You can hdfs dfs -cat a bunch of files at
once.

On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com
javascript:; wrote:
Hi,

I have a spark mapreduce task which requires me to write the final rdd
to an
existing local file (appending to this file). I tried two ways but
neither
works well:

Could anybody provide me an efficient way to solve this problem? I wish
that
the solution could be like: appending a huge rdd to a local file without
pausing the MapReduce during writing?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
For additional commands, e-mail: user-h...@spark.apache.org
javascript:;

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
For additional commands, e-mail: user-h...@spark.apache.org javascript:;

--
- Rishi

Re: How to write a RDD into One Local Existing File?

2014-10-17 Thread Sean Owen

You can save to a local file. What are you trying and what doesn't work?

You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.

How about just combining the files on HDFS afterwards? or just reading
all the files instead of 1? You can hdfs dfs -cat a bunch of files at
once.

On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com wrote:
Hi,

I have a spark mapreduce task which requires me to write the final rdd to an
existing local file (appending to this file). I tried two ways but neither
works well:

1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to
local, but I never make it work. Moreover, the result is not one file but a
series of part-x files which is not what I hope to get.

2. collect the rdd to an array and write it to the driver node using Java's
File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot
fit into the memory of one driver node. I have to split the task into small
pieces and collect them part by part and write; 2) During the writing by
Java IO, the Spark Mapreduce task has to wait, which is not efficient.

Could anybody provide me an efficient way to solve this problem? I wish that
the solution could be like: appending a huge rdd to a local file without
pausing the MapReduce during writing?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to write a RDD into One Local Existing File?

Re: How to write a RDD into One Local Existing File?

Re: How to write a RDD into One Local Existing File?

3 matches

Site Navigation

Mail list logo

Footer information