subject:"Re\: Help with collect\(\) in Spark Streaming"

Re: Help with collect() in Spark Streaming

2015-09-12 Thread Luca

I am trying to implement an application that requires the output to be
aggregated and stored as a single txt file to HDFS (instead of, for
instance, having 4 different txt files coming from my 4 workers).
The solution I used does the trick, but I can't tell if it's ok to
regularly stress one of the worker for the writing. That is why I thought
about having the driver collecting and storing the data.

Thanks for your patience and for your help, anyway. :)

2015-09-11 19:00 GMT+02:00 Holden Karau :

> Having the driver write the data instead of a worker probably won't spread
> it up, you still need to copy all of the data to a single node. Is there
> something which forces you to only write from a single node?
>
>
> On Friday, September 11, 2015, Luca  wrote:
>
>> Hi,
>> thanks for answering.
>>
>> With the *coalesce() *transformation a single worker is in charge of
>> writing to HDFS, but I noticed that the single write operation usually
>> takes too much time, slowing down the whole computation (this is
>> particularly true when 'unified' is made of several partitions). Besides,
>> 'coalesce' forces me to perform a further repartitioning ('true' flag), in
>> order not to lose upstream parallelism (by the way, did I get this part
>> right?).
>> Am I wrong in thinking that having the driver do the writing will speed
>> things up, without the need of repartitioning data?
>>
>> Hope I have been clear, I am pretty new to Spark. :)
>>
>> 2015-09-11 18:19 GMT+02:00 Holden Karau :
>>
>>> A common practice to do this is to use foreachRDD with a local var to
>>> accumulate the data (you can see it in the Spark Streaming test code).
>>>
>>> That being said, I am a little curious why you want the driver to create
>>> the file specifically.
>>>
>>> On Friday, September 11, 2015, allonsy  wrote:
>>>
 Hi everyone,

 I have a JavaPairDStream object and I'd like the
 Driver to
 create a txt file (on HDFS) containing all of its elements.

 At the moment, I use the /coalesce(1, true)/ method:


 JavaPairDStream unified = [partitioned stuff]
 unified.foreachRDD(new Function,
 Void>() {
 public Void call(JavaPairRDD arg0) throws Exception {
 arg0.coalesce(1,
 true).saveAsTextFile();
 return null;
 }
 });


 but this implies that a /single worker/ is taking all the data and
 writing
 to HDFS, and that could be a major bottleneck.

 How could I replace the worker with the Driver? I read that /collect()/
 might do this, but I haven't the slightest idea on how to implement it.

 Can anybody help me?

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>> Linked In: https://www.linkedin.com/in/holdenkarau
>>>
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Luca

Hi,
thanks for answering.

With the *coalesce() *transformation a single worker is in charge of
writing to HDFS, but I noticed that the single write operation usually
takes too much time, slowing down the whole computation (this is
particularly true when 'unified' is made of several partitions). Besides,
'coalesce' forces me to perform a further repartitioning ('true' flag), in
order not to lose upstream parallelism (by the way, did I get this part
right?).
Am I wrong in thinking that having the driver do the writing will speed
things up, without the need of repartitioning data?

Hope I have been clear, I am pretty new to Spark. :)

2015-09-11 18:19 GMT+02:00 Holden Karau :

> A common practice to do this is to use foreachRDD with a local var to
> accumulate the data (you can see it in the Spark Streaming test code).
>
> That being said, I am a little curious why you want the driver to create
> the file specifically.
>
> On Friday, September 11, 2015, allonsy  wrote:
>
>> Hi everyone,
>>
>> I have a JavaPairDStream object and I'd like the Driver
>> to
>> create a txt file (on HDFS) containing all of its elements.
>>
>> At the moment, I use the /coalesce(1, true)/ method:
>>
>>
>> JavaPairDStream unified = [partitioned stuff]
>> unified.foreachRDD(new Function, Void>() {
>> public Void call(JavaPairRDD> String> arg0) throws Exception {
>> arg0.coalesce(1,
>> true).saveAsTextFile();
>> return null;
>> }
>> });
>>
>>
>> but this implies that a /single worker/ is taking all the data and writing
>> to HDFS, and that could be a major bottleneck.
>>
>> How could I replace the worker with the Driver? I read that /collect()/
>> might do this, but I haven't the slightest idea on how to implement it.
>>
>> Can anybody help me?
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau

Having the driver write the data instead of a worker probably won't spread
it up, you still need to copy all of the data to a single node. Is there
something which forces you to only write from a single node?

On Friday, September 11, 2015, Luca  wrote:

> Hi,
> thanks for answering.
>
> With the *coalesce() *transformation a single worker is in charge of
> writing to HDFS, but I noticed that the single write operation usually
> takes too much time, slowing down the whole computation (this is
> particularly true when 'unified' is made of several partitions). Besides,
> 'coalesce' forces me to perform a further repartitioning ('true' flag), in
> order not to lose upstream parallelism (by the way, did I get this part
> right?).
> Am I wrong in thinking that having the driver do the writing will speed
> things up, without the need of repartitioning data?
>
> Hope I have been clear, I am pretty new to Spark. :)
>
> 2015-09-11 18:19 GMT+02:00 Holden Karau  >:
>
>> A common practice to do this is to use foreachRDD with a local var to
>> accumulate the data (you can see it in the Spark Streaming test code).
>>
>> That being said, I am a little curious why you want the driver to create
>> the file specifically.
>>
>> On Friday, September 11, 2015, allonsy  wrote:
>>
>>> Hi everyone,
>>>
>>> I have a JavaPairDStream object and I'd like the Driver
>>> to
>>> create a txt file (on HDFS) containing all of its elements.
>>>
>>> At the moment, I use the /coalesce(1, true)/ method:
>>>
>>>
>>> JavaPairDStream unified = [partitioned stuff]
>>> unified.foreachRDD(new Function, Void>()
>>> {
>>> public Void call(JavaPairRDD>> String> arg0) throws Exception {
>>> arg0.coalesce(1,
>>> true).saveAsTextFile();
>>> return null;
>>> }
>>> });
>>>
>>>
>>> but this implies that a /single worker/ is taking all the data and
>>> writing
>>> to HDFS, and that could be a major bottleneck.
>>>
>>> How could I replace the worker with the Driver? I read that /collect()/
>>> might do this, but I haven't the slightest idea on how to implement it.
>>>
>>> Can anybody help me?
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

Re: Help with collect() in Spark Streaming

Re: Help with collect() in Spark Streaming

Re: Help with collect() in Spark Streaming

3 matches

Site Navigation

Mail list logo

Footer information