Re: Help with collect() in Spark Streaming

Luca Fri, 11 Sep 2015 09:45:31 -0700

Hi,
thanks for answering.

With the *coalesce() *transformation a single worker is in charge of
writing to HDFS, but I noticed that the single write operation usually
takes too much time, slowing down the whole computation (this is
particularly true when 'unified' is made of several partitions). Besides,
'coalesce' forces me to perform a further repartitioning ('true' flag), in
order not to lose upstream parallelism (by the way, did I get this part
right?).
Am I wrong in thinking that having the driver do the writing will speed
things up, without the need of repartitioning data?


Hope I have been clear, I am pretty new to Spark. :)

2015-09-11 18:19 GMT+02:00 Holden Karau <hol...@pigscanfly.ca>:

> A common practice to do this is to use foreachRDD with a local var to
> accumulate the data (you can see it in the Spark Streaming test code).
>
> That being said, I am a little curious why you want the driver to create
> the file specifically.
>
> On Friday, September 11, 2015, allonsy <luke1...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I have a JavaPairDStream<Integer, String> object and I'd like the Driver
>> to
>> create a txt file (on HDFS) containing all of its elements.
>>
>> At the moment, I use the /coalesce(1, true)/ method:
>>
>>
>> JavaPairDStream<Integer, String> unified = [partitioned stuff]
>> unified.foreachRDD(new Function<JavaPairRDD&lt;Integer, String>, Void>() {
>>                                 public Void call(JavaPairRDD<Integer,
>> String> arg0) throws Exception {
>>                                         arg0.coalesce(1,
>> true).saveAsTextFile(<HDFS path>);
>>                                         return null;
>>                                 }
>> });
>>
>>
>> but this implies that a /single worker/ is taking all the data and writing
>> to HDFS, and that could be a major bottleneck.
>>
>> How could I replace the worker with the Driver? I read that /collect()/
>> might do this, but I haven't the slightest idea on how to implement it.
>>
>> Can anybody help me?
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>

Re: Help with collect() in Spark Streaming

Reply via email to