subject:"Spark Streaming over YARN"

Re: Spark Streaming over YARN

2015-10-04 Thread nibiau

4 partitions.

- Mail original -
De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
À: "Nicolas Biau" <nib...@free.fr>
Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org>
Envoyé: Dimanche 4 Octobre 2015 16:51:38
Objet: Re: Spark Streaming over YARN

How many partitions are there in your Kafka topic ? 

Regards, 
Dibyendu 

On Sun, Oct 4, 2015 at 8:19 PM, < nib...@free.fr > wrote: 

Hello, 
I am using https://github.com/dibbhatt/kafka-spark-consumer 
I specify 4 receivers in the ReceiverLauncher , but in YARN console I can see 
one node receiving the kafka flow. 
(I use spark 1.3.1) 

Tks 
Nicolas 

- Mail original - 
De: "Dibyendu Bhattacharya" < dibyendu.bhattach...@gmail.com > 
À: nib...@free.fr 
Cc: "Cody Koeninger" < c...@koeninger.org >, "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 18:21:35 

Objet: Re: Spark Streaming over YARN 

If your Kafka topic has 4 partitions , and if you specify 4 Receivers, messages 
from each partitions are received by a dedicated receiver. so your receiving 
parallelism is defined by your number of partitions of your topic . Every 
receiver task will be scheduled evenly among nodes in your cluster. There was a 
JIRA fixed in spark 1.5 which does even distribution of receivers. 

Now for RDD parallelism ( i.e parallelism while processing your RDD ) is 
controlled by your Block Interval and Batch Interval. 

If your block Interval is 200 Ms, there will be 5 blocks per second. If your 
Batch Interval is 3 seconds, there will 15 blocks per batch. And every Batch is 
one RDD , thus your RDD will be 15 partition , which will be honored during 
processing the RDD .. 

Regards, 
Dibyendu 

On Fri, Oct 2, 2015 at 9:40 PM, < nib...@free.fr > wrote: 

Ok so if I set for example 4 receivers (number of nodes), how RDD will be 
distributed over the nodes/core. 
For example in my example I have 4 nodes (with 2 cores) 

Tks 
Nicolas 

- Mail original - 
De: "Dibyendu Bhattacharya" < dibyendu.bhattach...@gmail.com > 
À: nib...@free.fr 
Cc: "Cody Koeninger" < c...@koeninger.org >, "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 18:01:59 

Objet: Re: Spark Streaming over YARN 

Hi, 

If you need to use Receiver based approach , you can try this one : 
https://github.com/dibbhatt/kafka-spark-consumer 

This is also part of Spark packages : 
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer 

You just need to specify the number of Receivers you want for desired 
parallelism while receiving , and rest of the thing will be taken care by 
ReceiverLauncher. 

This Low level Receiver will give better parallelism both on receiving , and on 
processing the RDD. 

Default Receiver based API ( KafkaUtils.createStream) using Kafka High level 
API and Kafka high Level API has serious issue to be used in production . 

Regards, 

Dibyendu 

On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote: 

>From my understanding as soon as I use YARN I don't need to use parrallelisme 
>(at least for RDD treatment) 
I don't want to use direct stream as I have to manage the offset positionning 
(in order to be able to start from the last offset treated after a spark job 
failure) 

- Mail original ----- 
De: "Cody Koeninger" < c...@koeninger.org > 
À: "Nicolas Biau" < nib...@free.fr > 
Cc: "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 17:43:41 
Objet: Re: Spark Streaming over YARN 

If you're using the receiver based implementation, and want more parallelism, 
you have to create multiple streams and union them together. 

Or use the direct stream. 

On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: 

Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB. 
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function) 
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment) 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming over YARN

2015-10-04 Thread nibiau

Hello,
I am using  https://github.com/dibbhatt/kafka-spark-consumer 
I specify 4 receivers in the ReceiverLauncher , but in YARN console I can see 
one node receiving the kafka flow.
(I use spark 1.3.1)

Tks
Nicolas

- Mail original -
De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
À: nib...@free.fr
Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org>
Envoyé: Vendredi 2 Octobre 2015 18:21:35
Objet: Re: Spark Streaming over YARN

If your Kafka topic has 4 partitions , and if you specify 4 Receivers, messages 
from each partitions are received by a dedicated receiver. so your receiving 
parallelism is defined by your number of partitions of your topic . Every 
receiver task will be scheduled evenly among nodes in your cluster. There was a 
JIRA fixed in spark 1.5 which does even distribution of receivers. 

Now for RDD parallelism ( i.e parallelism while processing your RDD ) is 
controlled by your Block Interval and Batch Interval. 

If your block Interval is 200 Ms, there will be 5 blocks per second. If your 
Batch Interval is 3 seconds, there will 15 blocks per batch. And every Batch is 
one RDD , thus your RDD will be 15 partition , which will be honored during 
processing the RDD .. 

Regards, 
Dibyendu 

On Fri, Oct 2, 2015 at 9:40 PM, < nib...@free.fr > wrote: 

Ok so if I set for example 4 receivers (number of nodes), how RDD will be 
distributed over the nodes/core. 
For example in my example I have 4 nodes (with 2 cores) 

Tks 
Nicolas 

- Mail original - 
De: "Dibyendu Bhattacharya" < dibyendu.bhattach...@gmail.com > 
À: nib...@free.fr 
Cc: "Cody Koeninger" < c...@koeninger.org >, "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 18:01:59 

Objet: Re: Spark Streaming over YARN 

Hi, 

If you need to use Receiver based approach , you can try this one : 
https://github.com/dibbhatt/kafka-spark-consumer 

This is also part of Spark packages : 
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer 

You just need to specify the number of Receivers you want for desired 
parallelism while receiving , and rest of the thing will be taken care by 
ReceiverLauncher. 

This Low level Receiver will give better parallelism both on receiving , and on 
processing the RDD. 

Default Receiver based API ( KafkaUtils.createStream) using Kafka High level 
API and Kafka high Level API has serious issue to be used in production . 

Regards, 

Dibyendu 

On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote: 

>From my understanding as soon as I use YARN I don't need to use parrallelisme 
>(at least for RDD treatment) 
I don't want to use direct stream as I have to manage the offset positionning 
(in order to be able to start from the last offset treated after a spark job 
failure) 

- Mail original - 
De: "Cody Koeninger" < c...@koeninger.org > 
À: "Nicolas Biau" < nib...@free.fr > 
Cc: "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 17:43:41 
Objet: Re: Spark Streaming over YARN 

If you're using the receiver based implementation, and want more parallelism, 
you have to create multiple streams and union them together. 

Or use the direct stream. 

On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: 

Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB. 
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function) 
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment) 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming over YARN

2015-10-04 Thread Dibyendu Bhattacharya

How many partitions are there in your Kafka topic ?

Regards,
Dibyendu

On Sun, Oct 4, 2015 at 8:19 PM, <nib...@free.fr> wrote:

> Hello,
> I am using  https://github.com/dibbhatt/kafka-spark-consumer
> I specify 4 receivers in the ReceiverLauncher , but in YARN console I can
> see one node receiving the kafka flow.
> (I use spark 1.3.1)
>
> Tks
> Nicolas
>
>
> - Mail original -
> De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
> À: nib...@free.fr
> Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 18:21:35
> Objet: Re: Spark Streaming over YARN
>
>
> If your Kafka topic has 4 partitions , and if you specify 4 Receivers,
> messages from each partitions are received by a dedicated receiver. so your
> receiving parallelism is defined by your number of partitions of your topic
> . Every receiver task will be scheduled evenly among nodes in your cluster.
> There was a JIRA fixed in spark 1.5 which does even distribution of
> receivers.
>
>
>
>
>
> Now for RDD parallelism ( i.e parallelism while processing your RDD ) is
> controlled by your Block Interval and Batch Interval.
>
>
> If your block Interval is 200 Ms, there will be 5 blocks per second. If
> your Batch Interval is 3 seconds, there will 15 blocks per batch. And every
> Batch is one RDD , thus your RDD will be 15 partition , which will be
> honored during processing the RDD ..
>
>
>
>
> Regards,
> Dibyendu
>
>
>
>
> On Fri, Oct 2, 2015 at 9:40 PM, < nib...@free.fr > wrote:
>
>
> Ok so if I set for example 4 receivers (number of nodes), how RDD will be
> distributed over the nodes/core.
> For example in my example I have 4 nodes (with 2 cores)
>
> Tks
> Nicolas
>
>
> - Mail original -----
> De: "Dibyendu Bhattacharya" < dibyendu.bhattach...@gmail.com >
> À: nib...@free.fr
> Cc: "Cody Koeninger" < c...@koeninger.org >, "user" <
> user@spark.apache.org >
> Envoyé: Vendredi 2 Octobre 2015 18:01:59
>
>
> Objet: Re: Spark Streaming over YARN
>
>
> Hi,
>
>
> If you need to use Receiver based approach , you can try this one :
> https://github.com/dibbhatt/kafka-spark-consumer
>
>
> This is also part of Spark packages :
> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
>
>
> You just need to specify the number of Receivers you want for desired
> parallelism while receiving , and rest of the thing will be taken care by
> ReceiverLauncher.
>
>
> This Low level Receiver will give better parallelism both on receiving ,
> and on processing the RDD.
>
>
> Default Receiver based API ( KafkaUtils.createStream) using Kafka High
> level API and Kafka high Level API has serious issue to be used in
> production .
>
>
>
>
> Regards,
>
> Dibyendu
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote:
>
>
> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> - Mail original -
> De: "Cody Koeninger" < c...@koeninger.org >
> À: "Nicolas Biau" < nib...@free.fr >
> Cc: "user" < user@spark.apache.org >
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: Spark Streaming over YARN

2015-10-02 Thread Dibyendu Bhattacharya

Hi,

If you need to use Receiver based approach , you can try this one :
https://github.com/dibbhatt/kafka-spark-consumer

This is also part of Spark packages :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer

You just need to specify the number of Receivers you want for desired
parallelism while receiving , and rest of the thing will be taken care by
ReceiverLauncher.

This Low level Receiver  will give better parallelism both on receiving ,
and on processing the RDD.

Default Receiver based API ( KafkaUtils.createStream) using Kafka High
level API and Kafka high Level API has serious issue to be used in
production .


Regards,
Dibyendu





On Fri, Oct 2, 2015 at 9:22 PM, <nib...@free.fr> wrote:

> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> - Mail original -
> De: "Cody Koeninger" <c...@koeninger.org>
> À: "Nicolas Biau" <nib...@free.fr>
> Cc: "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger

Neither of those statements are true.
You need more receivers if you want more parallelism.
You don't have to manage offset positioning with the direct stream if you
don't want to, as long as you can accept the limitations of Spark
checkpointing.

On Fri, Oct 2, 2015 at 10:52 AM, <nib...@free.fr> wrote:

> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> - Mail original -
> De: "Cody Koeninger" <c...@koeninger.org>
> À: "Nicolas Biau" <nib...@free.fr>
> Cc: "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Spark Streaming over YARN

2015-10-02 Thread nibiau

Hello,
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB.
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function)
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment)

Tks a lot
Nicolas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger

If you're using the receiver based implementation, and want more
parallelism, you have to create multiple streams and union them together.

Or use the direct stream.

On Fri, Oct 2, 2015 at 10:40 AM,  wrote:

> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau

>From my understanding as soon as I use YARN I don't need to use parrallelisme 
>(at least for RDD treatment)
I don't want to use direct stream as I have to manage the offset positionning 
(in order to be able to start from the last offset treated after a spark job 
failure) 

- Mail original -
De: "Cody Koeninger" <c...@koeninger.org>
À: "Nicolas Biau" <nib...@free.fr>
Cc: "user" <user@spark.apache.org>
Envoyé: Vendredi 2 Octobre 2015 17:43:41
Objet: Re: Spark Streaming over YARN

If you're using the receiver based implementation, and want more parallelism, 
you have to create multiple streams and union them together. 

Or use the direct stream. 

On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: 

Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB. 
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function) 
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment) 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau

Ok so if I set for example 4 receivers (number of nodes), how RDD will be 
distributed over the nodes/core.
For example in my example I have 4 nodes (with 2 cores) 

Tks
Nicolas 

- Mail original -
De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
À: nib...@free.fr
Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org>
Envoyé: Vendredi 2 Octobre 2015 18:01:59
Objet: Re: Spark Streaming over YARN

Hi, 

If you need to use Receiver based approach , you can try this one : 
https://github.com/dibbhatt/kafka-spark-consumer 

This is also part of Spark packages : 
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer 

You just need to specify the number of Receivers you want for desired 
parallelism while receiving , and rest of the thing will be taken care by 
ReceiverLauncher. 

This Low level Receiver will give better parallelism both on receiving , and on 
processing the RDD. 

Default Receiver based API ( KafkaUtils.createStream) using Kafka High level 
API and Kafka high Level API has serious issue to be used in production . 

Regards, 

Dibyendu 

On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote: 

>From my understanding as soon as I use YARN I don't need to use parrallelisme 
>(at least for RDD treatment) 
I don't want to use direct stream as I have to manage the offset positionning 
(in order to be able to start from the last offset treated after a spark job 
failure) 

- Mail original - 
De: "Cody Koeninger" < c...@koeninger.org > 
À: "Nicolas Biau" < nib...@free.fr > 
Cc: "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 17:43:41 
Objet: Re: Spark Streaming over YARN 

If you're using the receiver based implementation, and want more parallelism, 
you have to create multiple streams and union them together. 

Or use the direct stream. 

On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: 

Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB. 
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function) 
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment) 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming over YARN

2015-10-02 Thread Dibyendu Bhattacharya

If your Kafka topic has 4 partitions , and if you specify 4 Receivers,
messages from each partitions are received by a dedicated receiver. so your
receiving parallelism is defined by your number of partitions of your topic
.  Every receiver task will be scheduled evenly among nodes in your
cluster. There was a JIRA fixed in spark 1.5 which does even distribution
of receivers.


Now for RDD parallelism ( i.e parallelism while processing your RDD )  is
controlled by your Block Interval and Batch Interval.

If your block Interval is 200 Ms, there will be 5 blocks per second. If
your Batch Interval is 3 seconds, there will 15 blocks per batch. And every
Batch is one RDD , thus your RDD will be 15 partition , which will be
honored during processing the RDD ..


Regards,
Dibyendu


On Fri, Oct 2, 2015 at 9:40 PM, <nib...@free.fr> wrote:

> Ok so if I set for example 4 receivers (number of nodes), how RDD will be
> distributed over the nodes/core.
> For example in my example I have 4 nodes (with 2 cores)
>
> Tks
> Nicolas
>
>
> - Mail original -
> De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
> À: nib...@free.fr
> Cc: "Cody Koeninger" <c...@koeninger.org>, "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 18:01:59
> Objet: Re: Spark Streaming over YARN
>
>
> Hi,
>
>
> If you need to use Receiver based approach , you can try this one :
> https://github.com/dibbhatt/kafka-spark-consumer
>
>
> This is also part of Spark packages :
> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
>
>
> You just need to specify the number of Receivers you want for desired
> parallelism while receiving , and rest of the thing will be taken care by
> ReceiverLauncher.
>
>
> This Low level Receiver will give better parallelism both on receiving ,
> and on processing the RDD.
>
>
> Default Receiver based API ( KafkaUtils.createStream) using Kafka High
> level API and Kafka high Level API has serious issue to be used in
> production .
>
>
>
>
> Regards,
>
> Dibyendu
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 9:22 PM, < nib...@free.fr > wrote:
>
>
> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> - Mail original -
> De: "Cody Koeninger" < c...@koeninger.org >
> À: "Nicolas Biau" < nib...@free.fr >
> Cc: "user" < user@spark.apache.org >
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger

Direct stream has nothing to do with Zookeeper.

The direct stream can start at the offsets you specify.  If you're not
storing offsets in checkpoints, how and where you store them is up to you.

Have you read / watched the information linked from

https://github.com/koeninger/kafka-exactly-once


On Fri, Oct 2, 2015 at 11:36 AM, <nib...@free.fr> wrote:

> Sorry, I just said that I NEED to manage offsets, so in case of Kafka
> Direct Stream , how can I handle this ?
> Update Zookeeper manually ? why not but any other solutions ?
>
> - Mail original -
> De: "Cody Koeninger" <c...@koeninger.org>
> À: "Nicolas Biau" <nib...@free.fr>
> Cc: "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 18:29:09
> Objet: Re: Spark Streaming over YARN
>
>
> Neither of those statements are true.
> You need more receivers if you want more parallelism.
> You don't have to manage offset positioning with the direct stream if you
> don't want to, as long as you can accept the limitations of Spark
> checkpointing.
>
>
> On Fri, Oct 2, 2015 at 10:52 AM, < nib...@free.fr > wrote:
>
>
> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> - Mail original -
> De: "Cody Koeninger" < c...@koeninger.org >
> À: "Nicolas Biau" < nib...@free.fr >
> Cc: "user" < user@spark.apache.org >
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau

Sorry, I just said that I NEED to manage offsets, so in case of Kafka Direct 
Stream , how can I handle this ? 
Update Zookeeper manually ? why not but any other solutions ?

- Mail original -
De: "Cody Koeninger" <c...@koeninger.org>
À: "Nicolas Biau" <nib...@free.fr>
Cc: "user" <user@spark.apache.org>
Envoyé: Vendredi 2 Octobre 2015 18:29:09
Objet: Re: Spark Streaming over YARN

Neither of those statements are true. 
You need more receivers if you want more parallelism. 
You don't have to manage offset positioning with the direct stream if you don't 
want to, as long as you can accept the limitations of Spark checkpointing. 

On Fri, Oct 2, 2015 at 10:52 AM, < nib...@free.fr > wrote: 

>From my understanding as soon as I use YARN I don't need to use parrallelisme 
>(at least for RDD treatment) 
I don't want to use direct stream as I have to manage the offset positionning 
(in order to be able to start from the last offset treated after a spark job 
failure) 

- Mail original - 
De: "Cody Koeninger" < c...@koeninger.org > 
À: "Nicolas Biau" < nib...@free.fr > 
Cc: "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 17:43:41 
Objet: Re: Spark Streaming over YARN 

If you're using the receiver based implementation, and want more parallelism, 
you have to create multiple streams and union them together. 

Or use the direct stream. 

On Fri, Oct 2, 2015 at 10:40 AM, < nib...@free.fr > wrote: 

Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data 
inside MongoDB. 
It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) 
only on node is receiving all the kafka partitions and only one node is 
processing my RDD treatment (foreach function) 
How can I force YARN to use all the resources nodes and cores to process the 
data (receiver & RDD treatment) 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

Re: Spark Streaming over YARN

12 matches

Site Navigation

Mail list logo

Footer information