subject:"spark streaming kafa best practices \?"

Re: spark streaming kafa best practices ?

2014-12-17 Thread Tobias Pfeiffer

Hi,

On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell  wrote:
>
> On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas 
> wrote:
> > I was wondering why one would choose for rdd.map vs rdd.foreach to
> execute a
> > side-effecting function on an RDD.
>

Personally, I like to get the count of processed items, so I do something
like
  rdd.map(item => processItem(item)).count()
instead of
  rdd.foreach(item => processItem(item))
but I would be happy to learn about a better way.

Tobias

Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell

Foreach is slightly more efficient because Spark doesn't bother to try
and collect results from each task since it's understood there will be
no return type. I think the difference is very marginal though - it's
mostly stylistic... typically you use foreach for something that is
intended to produce a side effect and map for something that will
return a new dataset.

On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas  wrote:
> Patrick,
>
> I was wondering why one would choose for rdd.map vs rdd.foreach to execute a
> side-effecting function on an RDD.
>
> -kr, Gerard.
>
> On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell  wrote:
>>
>> The second choice is better. Once you call collect() you are pulling
>> all of the data onto a single node, you want to do most of the
>> processing  in parallel on the cluster, which is what map() will do.
>> Ideally you'd try to summarize the data or reduce it before calling
>> collect().
>>
>> On Fri, Dec 5, 2014 at 5:26 AM, david  wrote:
>> > hi,
>> >
>> >   What is the bet way to process a batch window in SparkStreaming :
>> >
>> > kafkaStream.foreachRDD(rdd => {
>> >   rdd.collect().foreach(event => {
>> > // process the event
>> > process(event)
>> >   })
>> > })
>> >
>> >
>> > Or
>> >
>> > kafkaStream.foreachRDD(rdd => {
>> >   rdd.map(event => {
>> > // process the event
>> > process(event)
>> >   }).collect()
>> > })
>> >
>> >
>> > thank's
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming kafa best practices ?

2014-12-17 Thread Gerard Maas

Patrick,

I was wondering why one would choose for rdd.map vs rdd.foreach to execute
a side-effecting function on an RDD.

-kr, Gerard.

On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell  wrote:
>
> The second choice is better. Once you call collect() you are pulling
> all of the data onto a single node, you want to do most of the
> processing  in parallel on the cluster, which is what map() will do.
> Ideally you'd try to summarize the data or reduce it before calling
> collect().
>
> On Fri, Dec 5, 2014 at 5:26 AM, david  wrote:
> > hi,
> >
> >   What is the bet way to process a batch window in SparkStreaming :
> >
> > kafkaStream.foreachRDD(rdd => {
> >   rdd.collect().foreach(event => {
> > // process the event
> > process(event)
> >   })
> > })
> >
> >
> > Or
> >
> > kafkaStream.foreachRDD(rdd => {
> >   rdd.map(event => {
> > // process the event
> > process(event)
> >   }).collect()
> > })
> >
> >
> > thank's
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark streaming kafa best practices ?

2014-12-05 Thread Patrick Wendell

The second choice is better. Once you call collect() you are pulling
all of the data onto a single node, you want to do most of the
processing  in parallel on the cluster, which is what map() will do.
Ideally you'd try to summarize the data or reduce it before calling
collect().

On Fri, Dec 5, 2014 at 5:26 AM, david  wrote:
> hi,
>
>   What is the bet way to process a batch window in SparkStreaming :
>
> kafkaStream.foreachRDD(rdd => {
>   rdd.collect().foreach(event => {
> // process the event
> process(event)
>   })
> })
>
>
> Or
>
> kafkaStream.foreachRDD(rdd => {
>   rdd.map(event => {
> // process the event
> process(event)
>   }).collect()
> })
>
>
> thank's
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark streaming kafa best practices ?

2014-12-05 Thread david

hi,

  What is the bet way to process a batch window in SparkStreaming :

kafkaStream.foreachRDD(rdd => {
  rdd.collect().foreach(event => {
// process the event
process(event)
  })
})


Or 

kafkaStream.foreachRDD(rdd => {
  rdd.map(event => {
// process the event
process(event)
  }).collect()
})


thank's



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming kafa best practices ?

Re: spark streaming kafa best practices ?

Re: spark streaming kafa best practices ?

Re: spark streaming kafa best practices ?

spark streaming kafa best practices ?

5 matches

Site Navigation

Mail list logo

Footer information