Re: Tuning and nimbus at 99%

Michael Rose Sun, 02 Mar 2014 19:10:29 -0800

The fact that the process is being killed constantly is a red flag. Also,
why are you running it as a client VM?


Check your nimbus.log to see why it's restarting.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
[email protected]


On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote:

>   uintx ErgoHeapSizeLimit                         = 0
> {product}
>     uintx InitialHeapSize                          := 27080896
>  {product}
>     uintx LargePageHeapSizeThreshold                = 134217728
> {product}
>     uintx MaxHeapSize                              := 698351616
> {product}
>
>
> so initial size of ~25mb and max of ~666 mb
>
> Its a client process (not server ie the command is "java -client
> -Dstorm.options...").  The process gets killed and restarted continously
> with a new PID (which makes getting the PID tough to get stats on).  I dont
> have VisualVM but if I run
>
> jstat -gc PID, I get
>
>  S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC
>   PU    YGC     YGCT    FGC    FGCT     GCT
> 832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0
> 21248.0 16029.6      5    0.268   0      0.000    0.268
>
> At this point I'll likely just rebuild the cluster.  Its not in prod yet
> as I still need to tune it.  I should have wrote 2 separate emails :)
>
> Thanks,
> S
>
>
>
>
> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <[email protected]>wrote:
>
>> I'm not seeing too much to substantiate that. What size heap are you
>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>> activity.
>>
>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> [email protected]
>>
>>
>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote:
>>
>>> Here it is.  Appears to be some kind of race condition.
>>>
>>> http://pastebin.com/dANT8SQR
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <[email protected]>wrote:
>>>
>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>> figure this out.
>>>>
>>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>>> using <1% CPU. Also look in your logs for any clues.
>>>>
>>>>
>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>> [email protected]
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]> wrote:
>>>>
>>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>
>>>>> I suppose I could just create a new cluster but Id like to know why
>>>>> this is occurring to avoid future production outages.
>>>>>
>>>>> Thanks,
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>
>>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]> wrote:
>>>>>>
>>>>>>> This is the first step of 4. When I save to db I'm actually saving
>>>>>>> to a queue, (just using db for now).  The 2nd step we index the data and
>>>>>>> 3rd we do aggregation/counts for reporting.  The last is a search that 
>>>>>>> I'm
>>>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in 
>>>>>>> real
>>>>>>> time to the clients it applies to.  I'd like this and the drpc to be 
>>>>>>> sub 2s
>>>>>>> which should be reasonable.
>>>>>>>
>>>>>>> Your right that I could speed up step 1 by not using trident but our
>>>>>>> requirements seem like a good use case for the other 3 steps.  With many
>>>>>>> results per second batching should effect performance a ton if the batch
>>>>>>> size is small enough.
>>>>>>>
>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>> killed?
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Is there a reason you are using trident?
>>>>>>>
>>>>>>> If you don't need to handle the events as a batch, you are probably
>>>>>>> going to get performance w/o it.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]> wrote:
>>>>>>>
>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>
>>>>>>>> - 4 spouts of events
>>>>>>>> - merges into one stream
>>>>>>>> - serializes the object as an event in a string
>>>>>>>> - saves to db
>>>>>>>>
>>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>>> intensive to speed it up.
>>>>>>>>
>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>
>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>
>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>> serialization one.
>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>> process latency.
>>>>>>>>
>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>> these events as close to realtime as possible.
>>>>>>>>
>>>>>>>> Does anyone have any guidance as to how to increase the throughput?
>>>>>>>>  Is it simply a matter of tweeking max spout pending and the batch 
>>>>>>>> size?
>>>>>>>>
>>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU 
>>>>>>>> usage
>>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  
>>>>>>>> Its at
>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>
>>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>>> which seems like it should be quite easy based on what Ive read that 
>>>>>>>> other
>>>>>>>> people have achieved.  I realize that hardware should be able to boost 
>>>>>>>> this
>>>>>>>> as well but my first goal is to get trident to push the records to the 
>>>>>>>> db
>>>>>>>> quicker.
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Sean
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ce n'est pas une signature
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>>
>>>>> Sean Solbak, BsC, MCSD
>>>>> Solbak Technologies Inc.
>>>>> 780.893.7326 (m)
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Sean Solbak, BsC, MCSD
>>> Solbak Technologies Inc.
>>> 780.893.7326 (m)
>>>
>>
>>
>
>
> --
> Thanks,
>
> Sean Solbak, BsC, MCSD
> Solbak Technologies Inc.
> 780.893.7326 (m)
>

Re: Tuning and nimbus at 99%

Reply via email to