Help with collect() in Spark Streaming

Holden Karau Fri, 11 Sep 2015 09:20:13 -0700

A common practice to do this is to use foreachRDD with a local var to
accumulate the data (you can see it in the Spark Streaming test code).


That being said, I am a little curious why you want the driver to create
the file specifically.

On Friday, September 11, 2015, allonsy <[email protected]
<javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:

> Hi everyone,
>
> I have a JavaPairDStream<Integer, String> object and I'd like the Driver to
> create a txt file (on HDFS) containing all of its elements.
>
> At the moment, I use the /coalesce(1, true)/ method:
>
>
> JavaPairDStream<Integer, String> unified = [partitioned stuff]
> unified.foreachRDD(new Function<JavaPairRDD&lt;Integer, String>, Void>() {
>                                 public Void call(JavaPairRDD<Integer,
> String> arg0) throws Exception {
>                                         arg0.coalesce(1,
> true).saveAsTextFile(<HDFS path>);
>                                         return null;
>                                 }
> });
>
>
> but this implies that a /single worker/ is taking all the data and writing
> to HDFS, and that could be a major bottleneck.
>
> How could I replace the worker with the Driver? I read that /collect()/
> might do this, but I haven't the slightest idea on how to implement it.
>
> Can anybody help me?
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

Help with collect() in Spark Streaming

Reply via email to