Re: How to write a RDD into One Local Existing File?
If you don't need part-xxx files in the output but 1 file, then you should repartition (or coalesce) the RDD into 1 (This will be bottleneck since you are disabling the parallelism - its like giving everything to 1 machine to process). You are better off merging those part-xxx files afterwards spark in hdfs (use hadoop fs -getmerge) Thanks Best Regards On Mon, Oct 20, 2014 at 10:01 AM, Rishi Yadav ri...@infoobjects.com wrote: Write to hdfs and then get one file locally bu using hdfs dfs -getmerge... On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote: You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably not a good idea as you are bottlenecking the output and some upstream computation by disabling parallelism. How about just combining the files on HDFS afterwards? or just reading all the files instead of 1? You can hdfs dfs -cat a bunch of files at once. On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com wrote: Hi, I have a spark mapreduce task which requires me to write the final rdd to an existing local file (appending to this file). I tried two ways but neither works well: 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to local, but I never make it work. Moreover, the result is not one file but a series of part-x files which is not what I hope to get. 2. collect the rdd to an array and write it to the driver node using Java's File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot fit into the memory of one driver node. I have to split the task into small pieces and collect them part by part and write; 2) During the writing by Java IO, the Spark Mapreduce task has to wait, which is not efficient. Could anybody provide me an efficient way to solve this problem? I wish that the solution could be like: appending a huge rdd to a local file without pausing the MapReduce during writing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- - Rishi
Re: How to write a RDD into One Local Existing File?
Write to hdfs and then get one file locally bu using hdfs dfs -getmerge... On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote: You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably not a good idea as you are bottlenecking the output and some upstream computation by disabling parallelism. How about just combining the files on HDFS afterwards? or just reading all the files instead of 1? You can hdfs dfs -cat a bunch of files at once. On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com javascript:; wrote: Hi, I have a spark mapreduce task which requires me to write the final rdd to an existing local file (appending to this file). I tried two ways but neither works well: 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to local, but I never make it work. Moreover, the result is not one file but a series of part-x files which is not what I hope to get. 2. collect the rdd to an array and write it to the driver node using Java's File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot fit into the memory of one driver node. I have to split the task into small pieces and collect them part by part and write; 2) During the writing by Java IO, the Spark Mapreduce task has to wait, which is not efficient. Could anybody provide me an efficient way to solve this problem? I wish that the solution could be like: appending a huge rdd to a local file without pausing the MapReduce during writing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:; - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:; -- - Rishi
Re: How to write a RDD into One Local Existing File?
You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably not a good idea as you are bottlenecking the output and some upstream computation by disabling parallelism. How about just combining the files on HDFS afterwards? or just reading all the files instead of 1? You can hdfs dfs -cat a bunch of files at once. On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com wrote: Hi, I have a spark mapreduce task which requires me to write the final rdd to an existing local file (appending to this file). I tried two ways but neither works well: 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to local, but I never make it work. Moreover, the result is not one file but a series of part-x files which is not what I hope to get. 2. collect the rdd to an array and write it to the driver node using Java's File IO. There are also two problems: 1) my RDD is huge(1TB), which cannot fit into the memory of one driver node. I have to split the task into small pieces and collect them part by part and write; 2) During the writing by Java IO, the Spark Mapreduce task has to wait, which is not efficient. Could anybody provide me an efficient way to solve this problem? I wish that the solution could be like: appending a huge rdd to a local file without pausing the MapReduce during writing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org