Re: save spark streaming output to single file on hdfs
Hi, You can use FileUtil.copyMerge API and specify the path to the folder where saveAsTextFile is save the part text file. Suppose your directory is /a/b/c/ use FileUtil.copyMerge(FileSystem of source, a/b/c, FileSystem of destination, Path to the merged file say (a/b/c.txt), true(to delete the original dir,null)) Thanks. On Tue, Jan 13, 2015 at 11:34 PM, jamborta [via Apache Spark User List] ml-node+s1001560n21124...@n3.nabble.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21167.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: save spark streaming output to single file on hdfs
thanks for the replies. very useful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: save spark streaming output to single file on hdfs
Thanks. The problem is that we'd like it to be picked up by hive. On Tue Jan 13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote: On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union multiple RDDs into single one. It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/save-spark-streaming-output-to-single- file-on-hdfs-tp21124.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: save spark streaming output to single file on hdfs
On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union multiple RDDs into single one. It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: save spark streaming output to single file on hdfs
Right now, you couldn't. You could load each file as a partition into Hive, or you need to pack the files together by other tools or spark job. On Tue, Jan 13, 2015 at 10:35 AM, Tamas Jambor jambo...@gmail.com wrote: Thanks. The problem is that we'd like it to be picked up by hive. On Tue Jan 13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote: On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union multiple RDDs into single one. It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org