Re: Tuning and nimbus at 99%

Sean Solbak Sun, 02 Mar 2014 18:51:24 -0800

  uintx ErgoHeapSizeLimit                         = 0
{product}
    uintx InitialHeapSize                          := 27080896
 {product}
    uintx LargePageHeapSizeThreshold                = 134217728
{product}
    uintx MaxHeapSize                              := 698351616
{product}



so initial size of ~25mb and max of ~666 mb

Its a client process (not server ie the command is "java -client
-Dstorm.options...").  The process gets killed and restarted continously
with a new PID (which makes getting the PID tough to get stats on).  I dont
have VisualVM but if I run

jstat -gc PID, I get

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC
  PU    YGC     YGCT    FGC    FGCT     GCT
832.0  832.0   0.0   352.9   7168.0   1115.9   17664.0     1796.0   21248.0
16029.6      5    0.268   0      0.000    0.268

At this point I'll likely just rebuild the cluster.  Its not in prod yet as
I still need to tune it.  I should have wrote 2 separate emails :)

Thanks,
S




On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <[email protected]>wrote:

> I'm not seeing too much to substantiate that. What size heap are you
> running, and is it near filled? Perhaps attach VisualVM and check for GC
> activity.
>
>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> [email protected]
>
>
> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote:
>
>> Here it is.  Appears to be some kind of race condition.
>>
>> http://pastebin.com/dANT8SQR
>>
>>
>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose <[email protected]>wrote:
>>
>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>> figure this out.
>>>
>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>> using <1% CPU. Also look in your logs for any clues.
>>>
>>>
>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> [email protected]
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]> wrote:
>>>
>>>> No, they are on seperate machines.  Its a 4 machine cluster - 2
>>>> workers, 1 nimbus and 1 zookeeper.
>>>>
>>>> I suppose I could just create a new cluster but Id like to know why
>>>> this is occurring to avoid future production outages.
>>>>
>>>> Thanks,
>>>> S
>>>>
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose 
>>>> <[email protected]>wrote:
>>>>
>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>
>>>>>  Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]> wrote:
>>>>>
>>>>>> This is the first step of 4. When I save to db I'm actually saving to
>>>>>> a queue, (just using db for now).  The 2nd step we index the data and 3rd
>>>>>> we do aggregation/counts for reporting.  The last is a search that I'm
>>>>>> planning on using drpc for.  Within step 2 we pipe certain datasets in 
>>>>>> real
>>>>>> time to the clients it applies to.  I'd like this and the drpc to be sub 
>>>>>> 2s
>>>>>> which should be reasonable.
>>>>>>
>>>>>> Your right that I could speed up step 1 by not using trident but our
>>>>>> requirements seem like a good use case for the other 3 steps.  With many
>>>>>> results per second batching should effect performance a ton if the batch
>>>>>> size is small enough.
>>>>>>
>>>>>> What would cause nimbus to be at 100% CPU with the topologies killed?
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Is there a reason you are using trident?
>>>>>>
>>>>>> If you don't need to handle the events as a batch, you are probably
>>>>>> going to get performance w/o it.
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]> wrote:
>>>>>>
>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>
>>>>>>> - 4 spouts of events
>>>>>>> - merges into one stream
>>>>>>> - serializes the object as an event in a string
>>>>>>> - saves to db
>>>>>>>
>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>> intensive to speed it up.
>>>>>>>
>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>
>>>>>>> The overall load of the topology seems fine.
>>>>>>>
>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>> - The highest capacity of any bolt is 0.3 which is the serialization
>>>>>>> one.
>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms process
>>>>>>> latency.
>>>>>>>
>>>>>>> So it seems trident has all the records internally, but I need these
>>>>>>> events as close to realtime as possible.
>>>>>>>
>>>>>>> Does anyone have any guidance as to how to increase the throughput?
>>>>>>>  Is it simply a matter of tweeking max spout pending and the batch size?
>>>>>>>
>>>>>>> Im running it on 2 m1-smalls for now.  I dont see the need to
>>>>>>> upgrade it until the demand on the boxes seems higher.  Although CPU 
>>>>>>> usage
>>>>>>> on the nimbus box is pinned.  Its at like 99%.  Why would that be?  Its 
>>>>>>> at
>>>>>>> 99% even when all the topologies are killed.
>>>>>>>
>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>> which seems like it should be quite easy based on what Ive read that 
>>>>>>> other
>>>>>>> people have achieved.  I realize that hardware should be able to boost 
>>>>>>> this
>>>>>>> as well but my first goal is to get trident to push the records to the 
>>>>>>> db
>>>>>>> quicker.
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ce n'est pas une signature
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>


-- 
Thanks,

Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)

Re: Tuning and nimbus at 99%

Reply via email to