A common practice to do this is to use foreachRDD with a local var to accumulate the data (you can see it in the Spark Streaming test code).
That being said, I am a little curious why you want the driver to create the file specifically. On Friday, September 11, 2015, allonsy <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > Hi everyone, > > I have a JavaPairDStream<Integer, String> object and I'd like the Driver to > create a txt file (on HDFS) containing all of its elements. > > At the moment, I use the /coalesce(1, true)/ method: > > > JavaPairDStream<Integer, String> unified = [partitioned stuff] > unified.foreachRDD(new Function<JavaPairRDD<Integer, String>, Void>() { > public Void call(JavaPairRDD<Integer, > String> arg0) throws Exception { > arg0.coalesce(1, > true).saveAsTextFile(<HDFS path>); > return null; > } > }); > > > but this implies that a /single worker/ is taking all the data and writing > to HDFS, and that could be a major bottleneck. > > How could I replace the worker with the Driver? I read that /collect()/ > might do this, but I haven't the slightest idea on how to implement it. > > Can anybody help me? > > Thanks in advance. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau
