Re: How to write a RDD into One Local Existing File?

2014-10-20 Thread Akhil Das
If you don't need part-xxx files in the output but 1 file, then you should
repartition (or coalesce) the RDD into 1 (This will be bottleneck since you
are disabling the parallelism - its like giving everything to 1 machine to
process). You are better off merging those part-xxx files afterwards spark
in hdfs (use hadoop fs -getmerge)

Thanks
Best Regards

On Mon, Oct 20, 2014 at 10:01 AM, Rishi Yadav ri...@infoobjects.com wrote:

 Write to hdfs and then get one file locally bu using hdfs dfs
 -getmerge...


 On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote:

 You can save to a local file. What are you trying and what doesn't work?

 You can output one file by repartitioning to 1 partition but this is
 probably not a good idea as you are bottlenecking the output and some
 upstream computation by disabling parallelism.

 How about just combining the files on HDFS afterwards? or just reading
 all the files instead of 1? You can hdfs dfs -cat a bunch of files at
 once.

 On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com wrote:
  Hi,
 
  I have a spark mapreduce task which requires me to write the final rdd
 to an
  existing local file (appending to this file). I tried two ways but
 neither
  works well:
 
  1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
 to
  local, but I never make it work. Moreover, the result is not one file
 but a
  series of part-x files which is not what I hope to get.
 
  2. collect the rdd to an array and write it to the driver node using
 Java's
  File IO. There are also two problems: 1) my RDD is huge(1TB), which
 cannot
  fit into the memory of one driver node. I have to split the task into
 small
  pieces and collect them part by part and write; 2) During the writing by
  Java IO, the Spark Mapreduce task has to wait, which is not efficient.
 
  Could anybody provide me an efficient way to solve this problem? I wish
 that
  the solution could be like: appending a huge rdd to a local file without
  pausing the MapReduce during writing?
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 --
 - Rishi



Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using hdfs dfs -getmerge...

On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote:

 You can save to a local file. What are you trying and what doesn't work?

 You can output one file by repartitioning to 1 partition but this is
 probably not a good idea as you are bottlenecking the output and some
 upstream computation by disabling parallelism.

 How about just combining the files on HDFS afterwards? or just reading
 all the files instead of 1? You can hdfs dfs -cat a bunch of files at
 once.

 On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com
 javascript:; wrote:
  Hi,
 
  I have a spark mapreduce task which requires me to write the final rdd
 to an
  existing local file (appending to this file). I tried two ways but
 neither
  works well:
 
  1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
 to
  local, but I never make it work. Moreover, the result is not one file
 but a
  series of part-x files which is not what I hope to get.
 
  2. collect the rdd to an array and write it to the driver node using
 Java's
  File IO. There are also two problems: 1) my RDD is huge(1TB), which
 cannot
  fit into the memory of one driver node. I have to split the task into
 small
  pieces and collect them part by part and write; 2) During the writing by
  Java IO, the Spark Mapreduce task has to wait, which is not efficient.
 
  Could anybody provide me an efficient way to solve this problem? I wish
 that
  the solution could be like: appending a huge rdd to a local file without
  pausing the MapReduce during writing?
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: user-h...@spark.apache.org
 javascript:;
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;



-- 
- Rishi


Re: How to write a RDD into One Local Existing File?

2014-10-17 Thread Sean Owen
You can save to a local file. What are you trying and what doesn't work?

You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.

How about just combining the files on HDFS afterwards? or just reading
all the files instead of 1? You can hdfs dfs -cat a bunch of files at
once.

On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com wrote:
 Hi,

 I have a spark mapreduce task which requires me to write the final rdd to an
 existing local file (appending to this file). I tried two ways but neither
 works well:

 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to
 local, but I never make it work. Moreover, the result is not one file but a
 series of part-x files which is not what I hope to get.

 2. collect the rdd to an array and write it to the driver node using Java's
 File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot
 fit into the memory of one driver node. I have to split the task into small
 pieces and collect them part by part and write; 2) During the writing by
 Java IO, the Spark Mapreduce task has to wait, which is not efficient.

 Could anybody provide me an efficient way to solve this problem? I wish that
 the solution could be like: appending a huge rdd to a local file without
 pausing the MapReduce during writing?






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org