Re: Help with collect() in Spark Streaming

2015-09-12 Thread Luca
I am trying to implement an application that requires the output to be aggregated and stored as a single txt file to HDFS (instead of, for instance, having 4 different txt files coming from my 4 workers). The solution I used does the trick, but I can't tell if it's ok to regularly stress one of

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Luca
Hi, thanks for answering. With the *coalesce() *transformation a single worker is in charge of writing to HDFS, but I noticed that the single write operation usually takes too much time, slowing down the whole computation (this is particularly true when 'unified' is made of several partitions).

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
Having the driver write the data instead of a worker probably won't spread it up, you still need to copy all of the data to a single node. Is there something which forces you to only write from a single node? On Friday, September 11, 2015, Luca wrote: > Hi, > thanks for