I am trying to implement an application that requires the output to be aggregated and stored as a single txt file to HDFS (instead of, for instance, having 4 different txt files coming from my 4 workers). The solution I used does the trick, but I can't tell if it's ok to regularly stress one of the worker for the writing. That is why I thought about having the driver collecting and storing the data.
Thanks for your patience and for your help, anyway. :) 2015-09-11 19:00 GMT+02:00 Holden Karau <hol...@pigscanfly.ca>: > Having the driver write the data instead of a worker probably won't spread > it up, you still need to copy all of the data to a single node. Is there > something which forces you to only write from a single node? > > > On Friday, September 11, 2015, Luca <luke1...@gmail.com> wrote: > >> Hi, >> thanks for answering. >> >> With the *coalesce() *transformation a single worker is in charge of >> writing to HDFS, but I noticed that the single write operation usually >> takes too much time, slowing down the whole computation (this is >> particularly true when 'unified' is made of several partitions). Besides, >> 'coalesce' forces me to perform a further repartitioning ('true' flag), in >> order not to lose upstream parallelism (by the way, did I get this part >> right?). >> Am I wrong in thinking that having the driver do the writing will speed >> things up, without the need of repartitioning data? >> >> Hope I have been clear, I am pretty new to Spark. :) >> >> 2015-09-11 18:19 GMT+02:00 Holden Karau <hol...@pigscanfly.ca>: >> >>> A common practice to do this is to use foreachRDD with a local var to >>> accumulate the data (you can see it in the Spark Streaming test code). >>> >>> That being said, I am a little curious why you want the driver to create >>> the file specifically. >>> >>> On Friday, September 11, 2015, allonsy <luke1...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> I have a JavaPairDStream<Integer, String> object and I'd like the >>>> Driver to >>>> create a txt file (on HDFS) containing all of its elements. >>>> >>>> At the moment, I use the /coalesce(1, true)/ method: >>>> >>>> >>>> JavaPairDStream<Integer, String> unified = [partitioned stuff] >>>> unified.foreachRDD(new Function<JavaPairRDD<Integer, String>, >>>> Void>() { >>>> public Void call(JavaPairRDD<Integer, >>>> String> arg0) throws Exception { >>>> arg0.coalesce(1, >>>> true).saveAsTextFile(<HDFS path>); >>>> return null; >>>> } >>>> }); >>>> >>>> >>>> but this implies that a /single worker/ is taking all the data and >>>> writing >>>> to HDFS, and that could be a major bottleneck. >>>> >>>> How could I replace the worker with the Driver? I read that /collect()/ >>>> might do this, but I haven't the slightest idea on how to implement it. >>>> >>>> Can anybody help me? >>>> >>>> Thanks in advance. >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >>> Linked In: https://www.linkedin.com/in/holdenkarau >>> >>> >> > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > Linked In: https://www.linkedin.com/in/holdenkarau > >