Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
Tobias,

Your help on the problems I have met have been very helpful. Thanks a lot!

Bill


On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 good to know you found your bottleneck. Unfortunately, I don't know how to
 solve this; until know, I have used Spark only with embarassingly parallel
 operations such as map or filter. I hope someone else might provide more
 insight here.

 Tobias


 On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Now I did the re-partition and ran the program again. I find a bottleneck
 of the whole program. In the streaming, there is a stage marked as 
 *combineByKey
 at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
 executed. However, during some batches, the number of executors allocated
 to this step is only 2 although I used 300 workers and specified the
 partition number as 300. In this case, the program is very slow although
 the data that are processed are not big.

 Do you know how to solve this issue?

 Thanks!


 On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 I haven't worked with Yarn, but I would try adding a repartition() call
 after you receive your data from Kafka. I would be surprised if that didn't
 help.


 On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone.
 In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
 not sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know
 which cluster you use, but with Mesos you could check client logs in the
 web interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
  wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a
 map which is updated by historical data in every minute. My current 
 method
 is to set batch size as 1 minute and use foreachRDD to update this map 
 and
 output the map at the end of the foreachRDD function. However, the 
 current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come
 instead of doing the update when the whoe RDD comes. Is there any idea 
 on
 how to achieve this in a better running time? Thanks!

 Bill











Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias,

I was using Spark 0.9 before and the master I used was yarn-standalone. In
Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
sure whether it is the reason why more machines do not provide better
scalability. What is the difference between these two modes in terms of
efficiency? Thanks!


On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know which
 cluster you use, but with Mesos you could check client logs in the web
 interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead of
 doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill







Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill,

I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.


On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone. In
 Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
 sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know which
 cluster you use, but with Mesos you could check client logs in the web
 interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead
 of doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill








Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias,

Now I did the re-partition and ran the program again. I find a bottleneck
of the whole program. In the streaming, there is a stage marked as
*combineByKey
at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
executed. However, during some batches, the number of executors allocated
to this step is only 2 although I used 300 workers and specified the
partition number as 300. In this case, the program is very slow although
the data that are processed are not big.

Do you know how to solve this issue?

Thanks!


On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 I haven't worked with Yarn, but I would try adding a repartition() call
 after you receive your data from Kafka. I would be surprised if that didn't
 help.


 On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone.
 In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
 not sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know
 which cluster you use, but with Mesos you could check client logs in the
 web interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the 
 current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead
 of doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill









Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill,

good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only with embarassingly parallel
operations such as map or filter. I hope someone else might provide more
insight here.

Tobias


On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi Tobias,

 Now I did the re-partition and ran the program again. I find a bottleneck
 of the whole program. In the streaming, there is a stage marked as 
 *combineByKey
 at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
 executed. However, during some batches, the number of executors allocated
 to this step is only 2 although I used 300 workers and specified the
 partition number as 300. In this case, the program is very slow although
 the data that are processed are not big.

 Do you know how to solve this issue?

 Thanks!


 On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 I haven't worked with Yarn, but I would try adding a repartition() call
 after you receive your data from Kafka. I would be surprised if that didn't
 help.


 On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone.
 In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
 not sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know
 which cluster you use, but with Mesos you could check client logs in the
 web interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a
 map which is updated by historical data in every minute. My current 
 method
 is to set batch size as 1 minute and use foreachRDD to update this map 
 and
 output the map at the end of the foreachRDD function. However, the 
 current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead
 of doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill










Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
Hi Tobias,

Thanks for the suggestion. I have tried to add more nodes from 300 to 400.
It seems the running time did not get improved.


On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and update
 a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead of
 doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill





Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Tobias Pfeiffer
Bill,

do the additional 100 nodes receive any tasks at all? (I don't know which
cluster you use, but with Mesos you could check client logs in the web
interface.) You might want to try something like repartition(N) or
repartition(N*2) (with N the number of your nodes) after you receive your
data.

Tobias


On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to 400.
 It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead of
 doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill