I am trying to implement an application that requires the output to be
aggregated and stored as a single txt file to HDFS (instead of, for
instance, having 4 different txt files coming from my 4 workers).
The solution I used does the trick, but I can't tell if it's ok to
regularly stress one of the worker for the writing. That is why I thought
about having the driver collecting and storing the data.

Thanks for your patience and for your help, anyway. :)

2015-09-11 19:00 GMT+02:00 Holden Karau <hol...@pigscanfly.ca>:

> Having the driver write the data instead of a worker probably won't spread
> it up, you still need to copy all of the data to a single node. Is there
> something which forces you to only write from a single node?
>
>
> On Friday, September 11, 2015, Luca <luke1...@gmail.com> wrote:
>
>> Hi,
>> thanks for answering.
>>
>> With the *coalesce() *transformation a single worker is in charge of
>> writing to HDFS, but I noticed that the single write operation usually
>> takes too much time, slowing down the whole computation (this is
>> particularly true when 'unified' is made of several partitions). Besides,
>> 'coalesce' forces me to perform a further repartitioning ('true' flag), in
>> order not to lose upstream parallelism (by the way, did I get this part
>> right?).
>> Am I wrong in thinking that having the driver do the writing will speed
>> things up, without the need of repartitioning data?
>>
>> Hope I have been clear, I am pretty new to Spark. :)
>>
>> 2015-09-11 18:19 GMT+02:00 Holden Karau <hol...@pigscanfly.ca>:
>>
>>> A common practice to do this is to use foreachRDD with a local var to
>>> accumulate the data (you can see it in the Spark Streaming test code).
>>>
>>> That being said, I am a little curious why you want the driver to create
>>> the file specifically.
>>>
>>> On Friday, September 11, 2015, allonsy <luke1...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I have a JavaPairDStream<Integer, String> object and I'd like the
>>>> Driver to
>>>> create a txt file (on HDFS) containing all of its elements.
>>>>
>>>> At the moment, I use the /coalesce(1, true)/ method:
>>>>
>>>>
>>>> JavaPairDStream<Integer, String> unified = [partitioned stuff]
>>>> unified.foreachRDD(new Function<JavaPairRDD&lt;Integer, String>,
>>>> Void>() {
>>>>                                 public Void call(JavaPairRDD<Integer,
>>>> String> arg0) throws Exception {
>>>>                                         arg0.coalesce(1,
>>>> true).saveAsTextFile(<HDFS path>);
>>>>                                         return null;
>>>>                                 }
>>>> });
>>>>
>>>>
>>>> but this implies that a /single worker/ is taking all the data and
>>>> writing
>>>> to HDFS, and that could be a major bottleneck.
>>>>
>>>> How could I replace the worker with the Driver? I read that /collect()/
>>>> might do this, but I haven't the slightest idea on how to implement it.
>>>>
>>>> Can anybody help me?
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>> Linked In: https://www.linkedin.com/in/holdenkarau
>>>
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>

Reply via email to