Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only with embarassingly parallel operations such as map or filter. I hope someone else might provide more insight here.
Tobias On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay <bill.jaypeter...@gmail.com> wrote: > Hi Tobias, > > Now I did the re-partition and ran the program again. I find a bottleneck > of the whole program. In the streaming, there is a stage marked as > *"combineByKey > at ShuffledDStream.scala:42" *in spark UI. This stage is repeatedly > executed. However, during some batches, the number of executors allocated > to this step is only 2 although I used 300 workers and specified the > partition number as 300. In this case, the program is very slow although > the data that are processed are not big. > > Do you know how to solve this issue? > > Thanks! > > > On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > >> Bill, >> >> I haven't worked with Yarn, but I would try adding a repartition() call >> after you receive your data from Kafka. I would be surprised if that didn't >> help. >> >> >> On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay <bill.jaypeter...@gmail.com> >> wrote: >> >>> Hi Tobias, >>> >>> I was using Spark 0.9 before and the master I used was yarn-standalone. >>> In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am >>> not sure whether it is the reason why more machines do not provide better >>> scalability. What is the difference between these two modes in terms of >>> efficiency? Thanks! >>> >>> >>> On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp> >>> wrote: >>> >>>> Bill, >>>> >>>> do the additional 100 nodes receive any tasks at all? (I don't know >>>> which cluster you use, but with Mesos you could check client logs in the >>>> web interface.) You might want to try something like repartition(N) or >>>> repartition(N*2) (with N the number of your nodes) after you receive your >>>> data. >>>> >>>> Tobias >>>> >>>> >>>> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com> >>>> wrote: >>>> >>>>> Hi Tobias, >>>>> >>>>> Thanks for the suggestion. I have tried to add more nodes from 300 to >>>>> 400. It seems the running time did not get improved. >>>>> >>>>> >>>>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp> >>>>> wrote: >>>>> >>>>>> Bill, >>>>>> >>>>>> can't you just add more nodes in order to speed up the processing? >>>>>> >>>>>> Tobias >>>>>> >>>>>> >>>>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have a problem of using Spark Streaming to accept input data and >>>>>>> update a result. >>>>>>> >>>>>>> The input of the data is from Kafka and the output is to report a >>>>>>> map which is updated by historical data in every minute. My current >>>>>>> method >>>>>>> is to set batch size as 1 minute and use foreachRDD to update this map >>>>>>> and >>>>>>> output the map at the end of the foreachRDD function. However, the >>>>>>> current >>>>>>> issue is the processing cannot be finished within one minute. >>>>>>> >>>>>>> I am thinking of updating the map whenever the new data come instead >>>>>>> of doing the update when the whoe RDD comes. Is there any idea on how to >>>>>>> achieve this in a better running time? Thanks! >>>>>>> >>>>>>> Bill >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >