Bill,

good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only with embarassingly parallel
operations such as map or filter. I hope someone else might provide more
insight here.

Tobias


On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay <bill.jaypeter...@gmail.com>
wrote:

> Hi Tobias,
>
> Now I did the re-partition and ran the program again. I find a bottleneck
> of the whole program. In the streaming, there is a stage marked as 
> *"combineByKey
> at ShuffledDStream.scala:42" *in spark UI. This stage is repeatedly
> executed. However, during some batches, the number of executors allocated
> to this step is only 2 although I used 300 workers and specified the
> partition number as 300. In this case, the program is very slow although
> the data that are processed are not big.
>
> Do you know how to solve this issue?
>
> Thanks!
>
>
> On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>
>> Bill,
>>
>> I haven't worked with Yarn, but I would try adding a repartition() call
>> after you receive your data from Kafka. I would be surprised if that didn't
>> help.
>>
>>
>> On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay <bill.jaypeter...@gmail.com>
>> wrote:
>>
>>> Hi Tobias,
>>>
>>> I was using Spark 0.9 before and the master I used was yarn-standalone.
>>> In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
>>> not sure whether it is the reason why more machines do not provide better
>>> scalability. What is the difference between these two modes in terms of
>>> efficiency? Thanks!
>>>
>>>
>>> On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp>
>>> wrote:
>>>
>>>> Bill,
>>>>
>>>> do the additional 100 nodes receive any tasks at all? (I don't know
>>>> which cluster you use, but with Mesos you could check client logs in the
>>>> web interface.) You might want to try something like repartition(N) or
>>>> repartition(N*2) (with N the number of your nodes) after you receive your
>>>> data.
>>>>
>>>> Tobias
>>>>
>>>>
>>>> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Tobias,
>>>>>
>>>>> Thanks for the suggestion. I have tried to add more nodes from 300 to
>>>>> 400. It seems the running time did not get improved.
>>>>>
>>>>>
>>>>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp>
>>>>> wrote:
>>>>>
>>>>>> Bill,
>>>>>>
>>>>>> can't you just add more nodes in order to speed up the processing?
>>>>>>
>>>>>> Tobias
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a problem of using Spark Streaming to accept input data and
>>>>>>> update a result.
>>>>>>>
>>>>>>> The input of the data is from Kafka and the output is to report a
>>>>>>> map which is updated by historical data in every minute. My current 
>>>>>>> method
>>>>>>> is to set batch size as 1 minute and use foreachRDD to update this map 
>>>>>>> and
>>>>>>> output the map at the end of the foreachRDD function. However, the 
>>>>>>> current
>>>>>>> issue is the processing cannot be finished within one minute.
>>>>>>>
>>>>>>> I am thinking of updating the map whenever the new data come instead
>>>>>>> of doing the update when the whoe RDD comes. Is there any idea on how to
>>>>>>> achieve this in a better running time? Thanks!
>>>>>>>
>>>>>>> Bill
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to