>From my understanding as soon as I use YARN I don't need to use parrallelisme 
>(at least for RDD treatment)
I don't want to use direct stream as I have to manage the offset positionning 
(in order to be able to start from the last offset treated after a spark job 
failure) 


----- Mail original -----
De: "Cody Koeninger" <[email protected]>
À: "Nicolas Biau" <[email protected]>
Cc: "user" <[email protected]>
Envoyé: Vendredi 2 Octobre 2015 17:43:41
Objet: Re: Spark Streaming over YARN


If you're using the receiver based implementation, and want more parallelism, 
you have to create multiple streams and union them together. 


Or use the direct stream. 


On Fri, Oct 2, 2015 at 10:40 AM, < [email protected] > wrote: 


Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB. 
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function) 
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment) 

Tks a lot 
Nicolas 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: [email protected] 
For additional commands, e-mail: [email protected] 



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to